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FOREWORD 


Achievement testing is an essential phase of educational process 
to know the extent of sucess achieved against the previously 
determined targets listed in the form of intended learning. The traditional 
norm referenced testing ™ directed towards relative performance of 
students and deviation judgements. It provides lop-sided view of pupil's 
learning as it disregards the need for assessing achievement in terms of 
absolute standards of performance usually stated in terms of intended 
leaming outcomes. . 


It is in this context that movement of Criterion-Referenced 
testing made its leeway in 1963 when Bob Glasser made a distinction 
between norm-referenced and Criterion-Referenced Measurement 
strategies. Emphasis оп individualised instruction, programmed 
learning, behavioural definitions and other such educational 
contributions provided the technical support to encourage 
Criterion-Referenced approach to measurement. Evaluators are now 
convinced that such devices of assessments can play’s more prominent 
role in improving students’ performance, quality of instruction and 
evaluation efforts. 


In India work on Criterion-Referenced testing is yet in the stage 
of infancy. There is a dire need of basic reading material which may 
provide theoretical constructs, material development strategies and 
guide posts related to construction апа validation ої 
Criterion-Referenced tests. The present compendium of pa on 
Criterion-Referenced measurement is a good addition to the already 
existing very little literature in this field. The papers from various 
contributors cover a large ground in this area and provide very good 
basis for evaluators to get insight into the field of Criterion-Referenced 
approach to measurement. 


| congratulate Pritam Singh and Kamla Menon who 
painstakingly worked-on this project to bring outthis publication at a 
time when emphasis in the National Education Policy is laid on use of 
evaluation as a diagnostic device for improvement of students' learning 
and quality of instruction. 


| am sure this volume would engender a lot of interest among 
teachers and evaluators alike to use the Criterion-Referenced approach 
for intergrating teaching and testing. Such progress diagnostic tests 


would go a long way in diagnosing students’ inadequacies in learning 
and provide the basis for remedial measures to bring most of the 
students to the expected level of mastery in their attainments. 
Suggestions and observations from the readers are most welcome. 


Date: 29.7.89 Dr. K. Gopalan 
Director 
N.C.E.R.T. 


РВЕЕАСЕ 


The growing concern of both teachers and educational 
administrators to ensure improvement in the quantum and quality and of 
learning has focussed attention on improving evaluation practices and 
procedures with a view to obtaining appropriate evidences and for using 
them for bringing about desired changes in the content and process of 
education. It is this emphasis on teaching learning that has made 
Criterion-Referenced Testing a viable alternative to the existing practice 
of grading, classification and certification of students on the basis of 
their performance and proficiency in testing situations. 


The use of the Criterion-Referenced methodology of evaluation 
has come to be recognised as more relevent to Indian education today 
than ever before. The growing rate of school drop-outs and the 
expansion of schooling facilities do indeed present a paradoxical 
situation in our context. Thus problems arising out of this can, to some 
extent be tackled by using intended learning or expected performance 
standard as the yearstick of judging students’ learning outcomes rather 
than actual pupils’ achievement on usual teacher made tests. 


It was this that motivated the Department of Measurement, 
Evaluation to intimate a project on Criterion Referenced Testing in 1985. 
As a part of this project, various sources of the work done in this area 
were consulied and expert opinion available in the country was tapped 
for evolving an approach to Criterion Referenced Evaluation as a 
preferred mode of evaluation in comparison to the traditional one. 


The present document is a compendium of papers presented in. 
the course of à seminar held in Delhi in September 1985 and contains 
the views of eminent teachers, educators, and researchers of the 
ecountry on Criterion-Referenced Evaluation, a theme on which not 
much work has so far been done in India. These papers focus the 
attention of the teachers, researchers and teacher educators on this 
subject of vital importance and are expected to motivate them to evolve 
useful test-development strategies using the Criterion-Referenced 
Approach. The papers have ben classified into three sections dealing 
with the Conceptual framework, development of Criterion-Referenced 
tests and the problems of measurement. The papers presented deal 
with very specific and fundamental issues related to each of these 
aspects and incorporate some research findings as well. Since in India 


work in this field is still at the infacy stage a few gaps have no doubt 
remained in these readings. * 


This project on preparation of readings on Criterion-Referenced 
Evaluation was undertaken by Pritam Singh formerly Professor in this 
department, in collaboration with Kamla Menon. In the complementary 
project on development of Criterion-Referenced Tests which is reported 
in this compendium. J.P. Shourie was also involved intensively. All of 
them deserve my appreciation for projecting the emergent trend of 
Criterion-Referenced Approach to testing. as an alternative to traditional 
norm-referenced testing. It is my hope that this volume would provide 
valuable reading material to the readers and focus the attention of both 
teachers and evaluators on the importance of diagnostic evaluation and 
remedial instruction for achieving the intended level of educational 
standards. 


Dr. H.S.SRIVASTAVA 
Dean (Academic) 


New Delhi 1989 Department of 
: Measurement, Evaluation, 
ad Survey and Data 
Processing, 
N.C.E.R.T. 
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Introduction 


Testing should enable the improvement of student's Learning and not 
merely restricted to classifying learners on the basis of their achievement. 
This indeed is the motto for today's evaluators and educational 
administrators. Fortunately, even teachers are sympathetic to the “failures” 
“slow learners" and wish to be able to correct or “remedy” the learning 
deficiency and related problems. It is in this situation that the mastery 
learning model of Bloom and his associates as well as concept of criterion 
based testing has attracted attention of the Indian educationists although it 
has yet to catch the fancy of the classroom teacher. 

The Programme of Action stated with regard to the National policy on 
Education emphasises the need for layng down the levels of a ttainment at 
classes V, VII, Vill, X and ХІІ, so that continuous comprehensive 
evaluation of pupils’ scholastic and non-scholastic development could be 
directed accordingly. It is in such а situation that criterion referenced 
measurement becomes relevant. 

The Indian experience has yet remained mainly a research-area and 
experiments have been confined mainly to methodology of test 
construction and validation of specific tests in languages and 
mathematics. The papers presented here refelct only an attempt to further 
the debate on not only the need for criterion referenced testing but also, 
highlight the issues involved in test validation and interpretation of scores. 

Section-| is devoted to the conceptual debate in the field of testing on 
whether criterion referenced tests measure yet another achievement of a 
previously determined criterion. This issue is predominant in the papers of 
Banerjee, Reddy, Lalithamma and Raizada in which the arguments in 
favour of criterion referenced measurement as a means to improve 
learning and its limited use for prediction or selection are proposed. 
There is an appeal for using both criterion and norm-referenced tests in 
schools, the former for diagnostic and the latter for comparison of students. 
The utility of CR tests for vocational education and certification of the 
competencies for work are emphasised. The advantage that criterion 
referenced test has, its usefulness as a progress diagnostic test. 
Improvement of performance standards is regarded as important 
consideration for these tests to offer analternative to the present 
external exmination system particularly for non-formal education. 

There are still several aspects of CR Tests that overlap and vary 
marginally from that of norm-referenced tests. These are related to the 


(vi) 
coverage of content and abilities, forms of questions and performance 
levels expected. The distinctiveness of CR Tests lies in the interpretation 
of the results and, therefore, in the assumptions underlying the 
maintainance of test quality and performance standards. 

The evaluation of exceptional children has always involved the use of 
criterion based measures. Further the paper emphasises rightly that 
determination and statement of a criterion acts as a standard both for the 
teacher and the taught. The utility of criterion referencing in curriculum 
evaluation is proposed by Brahadeeswarn and Ramachandra Chari). 
The paper emphasises that the effectiveness of identified goals and 
outcomes of the curriculum can best be judged when the level at which 
these goals are to be achieved, are determined, the curriculum transacted 
and the achievement evaluated on a mastery non-mastery basis. By such 
an analysis the objectives themselves are evaluated and the 
instructional strategy assessed. 

There is yet an important role of CR Tests as a part of the mastery 
learning teaching testing model. Discussed systematically both 
theoretically (Khader) and practically (Srivastava) its relevance and 
need become amply clear in these papers. The learner when given 
instruction for achievement of a predetermined standard has to achieve it, 
completely, otherwise instructional inadequcy or learners' deficiencies 
are identified as causes followed by remedial work to achieve the 
standard expected and thereby leading to improvement and progress. 
The progrmmed learning materials help construct hierarchies of content 
and abilities апа could be further supported with tests based on 
intended performance standards if these are pre-defined. Srivastava 
explains the actual use of CRT in counselling and remedial programmes 
at the Bureau of Psychology Allahabad. 


There are still several problems linked to CR Testing regarding the 
extent to which mastery of a concept can be measured without reference to 
a group. Among others, the unresolved issues listed, by Harikesh Singh are 
non-linearity of the scores, identification of the validity of the criteria and 
the factors underlying, besides method of determining validity. 

The second section includes papers related to test construction. In 
criterion-referenced tests construction, steps for tryout and refinement оѓ 
tests are issues still open to debate and verification. Anand Bushan 
has described the techniques of item generation which are based on 
stimulus homogeneity and response homogeneity while the steps in test 
construction are listed by D.J. Modi. the steps listed include identifying 
the test group, defining the domains and writing the tests. These are 
identified as prose learning, item forms, mapping sentence, domain 
based concept testing LOGIQ, IQI. The steps in the construction of test 
involves assesing content relevance by review, both logical and empirical, 
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using the pre-post instruction, discrimination and difficulty indices etc. 
Techniques of item selection and standard setting techniques are listed. 
Finding validity, using instructional sensitivity, analysis of objective 
based-ness and prediction аге discussed besides the reliability 
measures. Finally the contents of the manual and its utility are also 
discussed. 

In fact each step hitherto discussed have specific problems as 
explained in Singh's paper on defining the domain, establishing the 
relevance in our situation with the existing awarenes of the behavioural 
outcomes model, defining domains, the method and examples are 
clearly ilustrated. The models recommended are the single act, close 
domain and open domain methods. In each ofthese intended learning 
outcomes for each element of content are identified and the validity 
established. 

ltem generation, using the item form, mapping sentence, concept 
analysis and structural approaches discussed by Rathod are effectiely 
illustrated here. Baed on general description of stimulus and response 
characteristics, the item generation technique permits the estimation 
of a sample of items with definite abilities to which items can be added 
or modified. In this method it is assumed that the test is unidimensional, 
the items аге locally independent and each item has a characteristic curve 
vihen item score is regressed with persons ability. To find the cut off score 
and master non-master score the Rasch model is proposed as being 
viable. 

Each step in the development of criterion referenced tests has 
distinct alternatives. The construction of such tests is still at the stage of 
infancy in India and there is ample scope for investigation in this area. 

Like Norm-Referenced Tests quality of Criterion-Referenced Tests also 
depends on the quality of test items as also the quality of test as a whole. 
Therefore, the establishment of validity and reliability of Criterion 
Referenced Tests is as importnt as іп the case of Norm-Referenced 
Tests. However, connotation of validity and reliability needs to be 
understood in the context of criterion behaviours regarded as intended 
outcomes of learning rather than in terms of variance which is the basic 
assumption in case of Norm-Referenced Tests. Whereas interpretation of 
Norm-Referenced Tests is made in terms of variance in the group 
performace, the focus of interpretation in case of Criterion - Referenced 
Tests is on determining the mastery level achieved by students in terms of 
pre-determined performance criteria regarded as intended outcomes of 
learning. Thus, performance measures in case of Criterion-Referenced 
Testing reflects individual's performance related to teaching 
effectiveness vis-a-vis level of intended achievement while in case of 
norm-referenced measurement, it refiects performance of a student in 


terms of deviation from the group performance. 

Dhaliwal in his paper discusses some fundamental issues relating to 
measurement, which has bearing on th concept of validity and reliability. 
The one such issue is regarding additivity of marks based on different 
questions included in achievement test. Whether it is desirable to add 
marks pertaining to different questions in an achievement test, norm- 
referenced or criterion referenced, is discussed in details. Place of 
absolute zero and condition of equal apearing interval with its bearing 
on measurement is explained. Another issue pertaining to the role of 
measurement is discussed in relation to achievement test. The author 
also questions the justification of using statistical concept like mean, 
standard deviation, variance etc. in case of norm-referenced achievement 
test. Validity when looked in the context of fundamental issues of criterion 
referenced results is also looked at operationally and the need for 
coverage of allindependently examinable units of knowledge pertaining 
to a prescribed course are discussed. Three kinds of validation 
strategies, the descriptive, functional and domain selection validity are 
examined in the paper. 

Ved Prakash in his paper deals with the evaluation of the quality of 
Criterion Referenced Test items. Rating by content specialists is discussed 
to highlight the need for judgemental validity. Empirical approaches for 
judging the quality of the test items is discussed with reference to the 
facility values and discrimination indices. Techniques like upper-lower 
index, pre-test, post-test differences, Chi-square index and master’s - non- 
master's index are explained. Concept of facility index апа 
discrimination index as used in Criterion Referenced Tests is 
differentiated from the Norm-Referened Tests. 


Paper by Chandrakant Bhoghayata presents an algorithm for the 
measurement of uni-dimensionality of Criterion Referenced Test items of 
a behavioural or content domain. Application of this algorithm is 
illustrated by a hypothetical example. A brief discussion of the meaning, 
purpose and mathematical tools of graph-theory are also discussed 
briefly. This graph-theory are calso discussed briefly. This graph- 
theoretic algorithm is a better alternative of factor analysis for the 
empirical test of the uni-dimensionality assumption and gives three uses 
of the algorithm for criterion Referenced Measurement. 

A paper by Ram chandrachar and Brahadeeswaram discusses the 
establishment of reliability of Criterion Referenced Test. Different 
approaches for estimating the reliability of Critrion Referenced Tests, have 
been discussed. Three major categories of reliability, appropriate to the 
Criterion Referenced Test, are mentioned, namely, reliability of 
Criterion Referenced Test scores, reliability of domain scores estimates 
and reliability of mastery classification decision. The paper describes 


procedures that can be easily used by the teachers or estimating the 
reliability or Crierion Referenced Tests on the basis of the data obtained 
from single administration of the test. 


The paper by Bhoghayata gives the review of researches on Criterion 
Referenxced Measurement. In spite of the fact that Criterion Referenced 
Measurement is still an educational innovation in India, in developed 
countries like USA and UK there has been considerable research which can 
be attributed to many factors. Development in Measurement Theory and un- 
resolved issues in this area have led to recent proliferation of research on 
Criterion Referenced Measurement. 


Rathod in his paper on Application of Item Response theory to Criterion 
Referenced Tests makes comparison of the theories Limitations of classical 
test theory and generalisability theory are highlighted. Advantage of item 
response theory are discussed and the procedure for Criterion Referenced 
Tests and mastery test construction as also validation by using Rasch 
Model is presented. 


A theorical paper on decision theoric approach to Criterion Referenced 
Test by R.K. Mathur outlines some appropriate statistical method better use- 
ful for classificating of masters and non-masters in the sequence of forma- 
tive evaluation. Discussion centres round on contribution to Criterion Refer- 
enced Testing in the area of definition and terminology for allocation of stu- 
dents to mastery states and estimation of doman scores. 


Singh Kamla апа Shourie have conducted an empirical study on 
development of CRT at the primary stage. The results indicates the actual 
mastery level is selected public schools and corporation schools of Delhi. In- 
terestingly the findings of the study indicates that the number of masters are 
far more in case of students from Corporation schools as compared to those 
from public schools. 


Dr. PRITAM SINGH 
10 May, 1990. 


SECTION — 1 
CONCEPTUAL FRAMEWORK 


Criterion Referenced Measurement ~-A 
Perspective 


Tapan Banerjee 


ABSTRACT 
Critorion referenced measurement is one of the important trends 
in tho modem educational system. Here greater emphasa i placed 


pertormance, father. than comparing his performance with an сл 
‘side norm. One ol the essential teatures of the CRT is to examine 


The goal of Education is to erate the individual to acquiro re Aene 
shis, hats, atthudes and values. To help our thaderts achieve 
Competoncaes 


competencies, rather than on knowledge e.g. one of the competencies listed 
for class one or two for language teaching is that the child be able to read 
meaningfully different combination of alphabets as words. Evaluation will be 
done not merely for the purpose of ranks and grades e.g. to identify the 
student who has come first, second and so on, but will be done with respect 
to the learner himself i.e. his performance will be compared not with others’ 
performances but with his own previous performances, in order to see 
improvement in his learning. The Criterion Referenced Testing is based on the 
above mentioned concept of evaluation. It is obvious that іп a class students 
differ in their abilities. It is not possible for every student to learn with the same 
speed in the class-room situation. There may be slow learners or students with 
other learning disabilities-arising from some environmental deprivation. If their 
performances are compard with a ‘norm’ as done in ‘Norm Referenced Testing’ 
(NRT), it may be the fact that many students will be away from achieving this 
‘norm’. This failure experience will demoralising effect on their achievement. 
It will lower their self-esteem.But if on the other hand a student’s performance 
is compared with his previous performance and if his improvement is shown 
to him it will serve as a big motivating force. This success experience will lead 
him for future success. Criterion Referenced Testing (CRT) covers the above 
mentioned philosophy. 

Different specific learning outcomes will be formulated which will serve 
as criteria of accepted standard performance. Thus Criterion Referenced 
Measurements relate to criterion behaviour. So the purpose of CRT is to 
ascertain an individual's position with respect to a well defined behaviour 
domain. A clear cut about ‘domain’ is of utmost importance. Domain so far 
as content is concerned means a segment of knowledge. But behavioural 
domain can be described as the relevant learners' behaviour associated with 
an area of knowledge. Thus CRT items or questions are to be developed both 
on content area and desired behaviour associated with the content. The 
success of the CRT depends on our ability to relate these two aspects. In 
the first, items are mainly educational in nature and in the second case 
the items mainly educational in nature andin the scond case the items are 
Psychological in nature. Needless to mention in the present context that there 
is lot of overlapping in the two types of items or questions and it is difficult to 
separate educational items from psychological items, as in the educational 
items lot of psychological principles are involved. It is just a question of 
emphasis. It is generally agreed that it is the "Sign-metric' with which CRT 
or CRM is mainly related. Also other kinds of measurements scales (or 
metrics) viz. rating metric, accuracy metric, proportion metric, scaling metric 
etc. are also closely connected with criterion Referenced Measurement. 

Now the basic idea behind ‘mastery learning’ is that if proper facilities be 
provided to all the students they are expected to achieve the desired level of 
‘mastery’. But the question which automatically follows, what is the desired 
level of ‘mastery’. To identify the criterion of ‘mastery’ or standard of ‘mastery’ 
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is опе of the basic problems which are to be solved in the field of CRT. It will 
be quite unjust to stick to an absolute score for determining the criterion of 
‘mastery’, where there are so much variations amongst the students, in both 
cognitive and non-cognitive spheres. Thus it should be a ‘range of 
performance’ which should be the criterion of ‘mastery’ and not a particular 
score. 

As emphasised today ‘competencies’ of the learners are generally not 
properly evaluated, only their acquisition of information or knowledge is 
evaluated. Since knowledge is one of the means to develop competencies, 
the major object of evaluation as mentioned earlier, should be to examine 
whether the desired competencies are attained or not. The importance for 
attaining competencies is felt more emphatically today with the rapid 
growth of knowledge and the rate at which it is growing. Since in the 
modern ‘Learner Centred Approach’ to Education’ of the National Educational 
Policy greater emphasis is placed on ‘self-learning’, the future oriented 
education emphasises basic competencies. Inthe CRT evaluation is based 
on the assessment of these generalised aspect of knowledge. So the items 
in the criterion Referenced Measurement need not be in the same format as 
taught in the class room situation. Another point which deserves special 
attention is that though in the CRT we generally compare a students’ 
performance with his own performance, a statisfactory standard is to be 
determined with which such comparison may be made. This will enable us to 
examine whether the desired level is reached or not. 

It has already been mentioned that both intellectual and non-intellectual 
development are within the scope of CRT. The affective and psychomotor 
dominas, interest, attitudes, values etc. are all within the scope of the CRT. 
Here also we must seek for satisfactory criterion in each case. One of the 
important considerations which should be taken into account is that though 
there is difference inthe approaches, there is sufficient overlapping in the 
CRM and NRM. 

Validation study is one of the important dimensions which deserves 
attention in the field of CRT. Since the evaluation of the present education 
System is competencly oriented there is sufficient role of some external 
criterion in the CRT too. The most important dimension for an effective 
validation study of а test specially if the test is designed to meet a specific 
prediction problem, the first goal should be the achievement of a good criterion. 
It is essential that sufficient time should be spent on the development of the 
criterion. When one is aiming at factorial validity, however, the goal is to 
develop tests for general purpose and an immediate validation against 
practical criteria is not essential. Whenever a worker in psychology or 
education desires to measure some quality in a group or individual, he 
faces the problem of choosing the best instrument for his purpose. 


Generally there will be several tests or testing procedure that have been 
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developed for or that seem to be least possible for his purpose. He must 
select among these. He is interested in determining not only which is the 
best procedure, but how well it satisfies his needs by some absolute standard. 
There are many specific consideration in the field of evaluation of a test. 
Validity is one of the most important of them. Validity refers to the extent to 
which the test measures what we actually wish to measure. The role of 
criterion in the study of validity is evident from the following two situations. 
Let us suppose we give to a group of children a test of reading achievement. 
The test requires the children to select certain answer toa series of 
questions about reading passages and to make pencil marks on an answer 
sheet. From the right answers we get his score on reading comprehension. 
But the score-itself is not the reading compherhension. It is the record 
of a sample of behaviour. Any judgement regarding comprehension is an 
inference based on the evidence provided by the number of allegedly correct 
answers. Its validity is not self-evident but is something we must establish on 
the basis of adequate evidence. The question of criterion here is very much 
evident. Again let us consider the typical personality inventory that tries 
to provide an appraisal of “emotional adjustment”, one of the non- 
intellectual area of CRT. In this type of inventory the respondent marks a series 
of statements as being characteristic of him or not characteristic of 
him. On the basis of his responses we get his score on "emotional 
adjustment”. But making certain marks on apiece of paper is a number of 
steps removed from actually exhibiting emotional distrubance. We must 
find some way of establishing the extent to which the performance on this 
test corresponds to the quality of behaviour in which we are directly interested. 
Here also the impact of criterion can not be overemphasised. 


The field of "criterion referenced measurement" is closely connected with 
the different types of validity used in the field of quantitative psychology. The 
nature of criterion depends upon the nature of validity used in the field of 
quantative psychology. The nature of criterion depends upon the nature of 
validity desired. There are two main types of evidence bearing on the validity 
of atest, rational and empirical. CRT is more concerned with the former. On 
the one hand, we encounter a wide range of testing situations in which 
appraisal of the validity of a measurement procedure depends primarily upon 
rational analysis and professional judgement. The analysis may be of the topics 
and areas included in the test-its content. The rational analysis may be of the 
activities and processes that correspond to a particular concept (such as 
‘Scientific method’) and we may speak of concept or construct validity. The 
second main type of evidence is empirical and statistical. This type of 
evidence comes from the relationship ofthe instrument that we are 
studying to some other measure or fact. This other measure of fact may be 
very closely similar to our test or it may be quite different. It may be obtained 
at about the same time our test is given or it may not be available for a long 
time in the fututre. Congruent validity refers to the evidence of validity obtained 
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by correlating a test with an existing similar measure of the same function. 
Thus correlating an abstract-reasoning test with already existing tests 
would provide evidence on congruent validity. Concurrent validity will refer 
to evidence of validity obtained by relating the test to some other measure 
obtained atthe same time. Ifa test devised to appraise sociability were 
correlated with ratings on sociability by close friends, this would provide 
evidence on the concurrent validity of the test. Predictive validity refers to 
the validity of a test or other measuring instrument when it is related to some 
criterion of performance or success that becomes available in future and is 
quite different from the test itself. Thus when a scientific aptitude test given 
to high school seniors is correlated with college freshman grades evidence 
is being obtained on its predictive validity. All these types of validity are 
effectively applied in the CRT so far competency-oriented evaluation is 
concerned. 

In many situations the analysis of the crucial concept is the key to preparing 
a valid test or to appraising the validity of one that has already been 
constructed. We encounter such concept as ‘Scientific thinking’ 
*fairmindedness' ‘rigidity’ or ‘reading comprehension’. Before we can make 
progress with the task of measurement, we must analyse these global and 
often fuzzy concepts into their behavioural components. It is against this 
analysis that we must check our test to judge whether it has construct 
validity. These are of utmost importance inthe CRT when we focus our 
attention on the non-intellectual part of evaluation. 

Let us now focus our attention on the problem of criterion. We have said 
that predictive validity can be estimated by determining the correlation 
between test scores and a suitable criterion measure of success. The point 
here is the phrase 'Suitable criterion measure" One of the most dificult 
problems that the psychologist or eductor faces is that of locating or creating 
a satisfactory measure of success to serve as acriterion measure for 
test validation. It may seem that this measure, once decided upon, should 
be obtainable in an easy and straightforward fashion. Unfourtunately, this is 
not so. Finding or developing acceptable criterion measures usually involves 
the research. worker in the field of tests and measurements in a number of 
trouble-some problems. 

Difficulties in obtaining satisfactory criterion measures arise from a variety 
of sources. There are many types of area that yield no objective record of 
performance or production, as, for example that of 'initiative' for which we 
might be interested in using our test of effectiveness of expersion. But even 
when such records are available, they are often influenced by a variety of 
factors outside the students control. There are always many criterion 
measures that might be obtained and used for validating a test. In addition 
to quantitative performance records and subjective ratings, one might use later 
tests of proficiency. All criterion measures are only partial in that they measure 


only а рап of success. This is true when tests for geometric reasoning may 
be validated against success in Technical School. They represent a 
relatively immediate but quite partial criterion of success as an engineer.The 
ultimate criterion is some appraisal of the man's life timé success in his 
profession. In the very nature of things, such an ultimate criterion is 
inaccessible to us and we must be satisfied with substitutes for it. These 
substitutes are only partial and are never completely satisfactory. Our 
problem is always to choose the most satisfactory from among the 
measures that it appears feasible to obtain. These are to be carefully 
considered in the field of CRT. There are four qualities that we shall desire in 
a criterion measure. In order of their importance they are 

(1) Relevance 

(2) Freedom from bias 

(3) Reliability and 

(4) Availability. 

We judge a criterion to be relevant is so far as score on the criterion 
measure is determined by the same factors that determine success on the 
area. In appraising the relevance of a criterion we are thrown back once 
more upon rational consideration, specially when the tests are developed 
on different domians, which are to be validated against different conpetencies. 
A second factor important in a criterion measure should provide each person 
with the same opportunity to make a good score. The topic of reliability as 
it implies to the criterion scores, the problem is merely this: a measure 
of success on the particular area must be stable or reproducible if it is to be 
predicted by any type of test device. Finally, in the choice of criterion 
measure one always encounters practical problems of convenience and 
availability. They deal with such questions, how long is it going to take to get 
a criterion score for each individual? how much is it going to cost?. These 
questions are to be carefully considered in the "Unit Approach" of 
measurement. A unit of study may be understood as a block of closely related 
subject matter which can be conveniently overviewed by the learner within 
a shrot span of time. Unittest сап also be developed on non-cognitive 
variables, like psychomotor, affective etc. The availability of a quick and 
accurate criterion in these cases is of utmost importance. Let us now discuss 
some of the important points which are to be carefully considered in the 
area of criterion referenced measurement. No other technique and no other 
body of theory in psychology has been so fully rationalised from the 
mathematical point of view than the psychological test. There has been 
considerable interest in recent years in personal variablity in measured ability. 
It is sometimes pointed out that we should know not only the examinee's 
characteristic level on a scale of ability but also his degree of consistency in 
performing near that level. It is possible that individuals differ Systematically 
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from one another in their consistency as wellas in their level of 
performance. If a certain examinee is quite consistent, his level of performance 
will be quite predictable. If another examinee is markedly inconsistent or 
variable about his mean he is to that extent extent unpredictable in his 
ability. Usual test practices seem to operate on the assumption that all 
examinees are equally predictable. This point is to be given due 
consideration іп any criterion referenced measurement. In the case of 
domian-referenced tests it is observed that the performance of a particulatr 
student are not uniform in all the domains. Another aspect which deserves 
consideration is the estimation of true variance. It is possible to obtain an 
estimate of the extent of the true variance in a set of scores if the co-efficient 
of reliability and total variance are known by the equation, 
Oa = Ot Үш. 

The validity of a composite of item scores, like any composite used to 
predict a criterion measure depends upon both the coorrelation of the items with 
the criterion and the item intercorrelations. The most apparent principle is that 
the greater the item-criterion correlations and the lower the item inter- 
correlations, the greater the validity of the total score. The optional validity of 
a total score would be attained with different weighting for each item, in 
accordance with multi-correlation principles. Humphreys has shown that with 
items of uniform level of difficulty, the correlation of total score with criterion is 
estimated by the equation. 


where rt, = correlation between test score and criterion 
ri, = average correlation between item and criterion 


ri, = average correlation between item and total score. 


‚ Thus the validity coefficient, under these conditions, equals to the ratio of 
the mean item-correlation to the mean item-total correlation. For high validity, 
ric should be relatively large, and rit relatively small. Thus when a test is used 
alone to predict a criterion it has a better chance of being valid when it is of low 
internal consistency. A major problem in any research study arises in 
choosing a satisfactory criterion for evaluation of a hypothesis. The most 
frequently reported measures of the validity of atest in a given situation is 
the co-efficient of correlation between test scores and measure of a 
selected criterion. Unless the criterion measures are true score, such a 
correlation is more appropriately refered to as measure of predictive 
effectiveness than a co-efficient of validity. 


One of the important aspect in the sphere of criterion referenced 
measurement is the correction for attenuation. Inter-correlations of tests and 
of tests with criterion, are restricted in size because of the amount of error 
variance in each, where error varriances are uncorrelated. When two 


fallible measures аге correlated, the errors of measurement, if uncorrelated 
among themselves, always serve to lower the co-eficient of correlation as 
compared with what it would have been had two measures been perfectly 
reliable. We say that the degree of correlation has been attenuated. If we 
want to know what the correlation would have been it the two variables were 
perfectly measured, we must resort to the correction for attenuation for which 


the given formula is: А 
мо = oe "y 


But one thing to be kept in mind that in predicting criterion measures 
from test scores, one should not make a complete correction for 
attenuation. Correction should be made in the criterion only and in that 
case the formula for a one way correction is — rxw = 

where r,, is the correlation with correction in Y only? уу 


Factor analysis of criterion measures is a profitable approach. Knowledge 
of the reliability of criterion measures is minimum information. Knowledge 
of their factor loadings provide considerable additional information. There 
are several operations that make this possible. When there are several 
criterion measures to be considered for use, such information tells us which 
ones are likely to be considerd for use, such information tells us which ones 
are likely to be more relevant and how thay should or not be combined, and 
if combined how they should be weighted. It also tells us what tests should 
be included in a batery to predict the criteria and enables us to predict validity 
ofa test in advance. It furthermore tells us where a battery is weak, where an 
important factor in the criterion is represented insufficiently by the battery. If 
the communality of the criterion is definitely lower than its reliability, we 
have a challenge to identify the "specific" variance as possibly due to other 
common factors not yet taken into account. 


Problems of multiple prediction arise because either the scores from which 
predictions are made may be from a single test or from several or the criterion 
measures predicted may come from a single domain or from several 
domains. The usual situation is a multiple predictor and a single criterion 
measures. In this case, the single criterion measure may be a composite 
score. Whether the criterion is singly derived or a composite, the usual 
procedure is to derive a multiple regression equation, with weights that will 
maximise the correlation between predictd and obtained criterion measures. 
But the multiple regression approach has its limitations, if it is to be considerd 
that these methods rest on the assumption of linear regressions among the 
measures going into the equation. As Mosier points out when the entire 
range of ability is taken into account, the regressions of a criterion on scores 
of ability are probably quite commonly non-liner.Above a minimal level the 
regression may be horizontal and for upper levels of 1.0 might even decline. 
In many researches the existence of а thereshold 1.0 in the sphere of 
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academic achievement has been observed. Insome studies dealing with 
Anderson Ability Gradient theory in which a threshold 1.0 has been located 
beyond which 1.0 has no effect on accademic achievement (criterion) but 
reatively would begin to effect achievement. This non-linearity in regression 
is clearly observed while dealing with personality variables. In many 
stations dealing with criterion referenced measures this fact has not been duly 
considered, yielding distorted results. In the case of interest and 
temparament scores, Guilford has pointed out the actual and potential 
existence of many curvilinear regressions, An inverted U-shape relationship 
has been observed between neuroticism and performance. Correlational 
analysis which is the fundamental of any criterion referenced measurement, is 
based on the assumption that the relationship being investigated is linear and 
homoscedastic. This implies that the relationship will be constant at all value 
of both variables. While investigations are limited to cognitive variables, 
such as school attainment and verbal reasoning ability these assumptions 
can generally be justified. But in many recent studies reporting correlational 
analysis between school achievement and non- intellectual variables such as 
social background or personality the assumptions of linearity and 
homoscedasticity do not hold [Entwistle (1968), Eysenck & Cookson (1970), 
Elliott (1972) Abbot (1974), Orpen (1976).] At any rate, it poses a problem 
to the investigator who employs multiple regression procedures to assure 
himself that the regressions are linear. Even if some regressions are non- 
linear, it is possible by transformations to reduce these to rectilinear form 
and then use multiple regression methods. 

Often the curvature in regression is so slight that we do not know but that 
it is merely a change deviation from linearity, we therefore want some 
statistical test to show whether or not the curvature is probably real. Probably 
the most dependable one is that suggested by Fisher whose formula is based 
ona chi-square test. Another point which requires due weightage in the 
criterion referenced measure is the linear restraint. It is a common 
experience to find that after three or four of the most valid tests, have been 
combined in our equation to predict acriterion, adding more tests rarely 
improves prediction. This is a common experience that the out come just 
described is likely to happen when the intercorrelations of predictors are 
substantially. One way of reducing the number of restraints is to combine tests 
that inter-correlate high with one another, and let them enter the battery as 
one varibale. When the criterion consists of several different measures 
then complication in prediction arises. One of the ways to solve this problem 
statistically is by deriving for the tests combining weights that would predict 
best the most predictable weighted-composite criterion. 


Let us mow focus our attention to creativity tests. During the last few years 
there has been a greatly increased interest in the cognitive aspect of ‘creativity’ 
which stems to a large extent from the work of J.P. Guilford and his associates. 
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In particular his distinction between convergent and divergent thinking and the 
construction of new kinds of tests to measure the later has led many 
psychologists to believe more strongly that these important kinds of ability 
have not been adequately assessed by conventional intelligence and other 
ability tests. The problem of criterion is very important to creativity researchers. 
There is always some doubt whether some of the criteria that now exist are 
truly measuring creativity. We are getting many creative and non-creative 
criterion dimensions. These results make us keenly aware, that, as each 
new criterion dimension is isolated, a new and important psychological 
phenomenon is revealed for exploration. In some studies the 
relationships between biographical information with different indices of 
creativity have been studied. In one interesting study on Air Force 
Scientists 56 criterion measures from 8 different sources were boiled down 
to 14 criterion factors. A striking finding in this criterion study was that no single 
criterion measured more than 4 of our 14 criterion dimensions, and most 
criteria spanned from 1 to 3 of the total criterion dimensions. We, therefore, 
conclude that, when nearly a single criterion measure from only one source of 
information is used, there is good chance that many performances and 
contributions on the job are being missed. These results stress the selection 
of suitable criterion in the development of different creativity tests. This 
creative ability is present in varying amount amongst the students. It is one of 
the major task of the teacher to foster the creative development of the 
students through curricular and co-cu rricular activities development. In the 
CRT the teacher should frame such questions in different domains where the 
student can express their creative abilities. As mentioned earlier the 
development of suitable criteria for this purpose is one of the vital issues. 
Some thorough research is urgently needed as we can not neglect these 
creative abilities in our student. 
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Criterion Referenced Testing In 


Mastery Learning 
M.A. Khader 


ABSTRACT 


The model of mastery learning as proposed by Carroll and 
Bloom emphasise the need to evaluate performance with regard to 
pre-determined criterion and not compare to others performance 
only. The advantage that criterion referenced testing has in the 
mastery learning model is emphasised. The qualities that such 
tests must possess to function well as a progress diagnostic test is 
analysed and the steps in the development of such tests 
enumerated. The need to prepare short tests based on basic 
competencies is underscored. 


1. MASTERY LEARNING MODEL 

Though the idea that most students can learn what the schools have to 
teach, if they are taught systematically, is a very old one, the notion becomes 
the central theme of mastery of specific learning tasks rather than the time 
spent to attain it. The underlying rationale is that most students can attain 
the mastery of the subject, if proper guidance is provided wherever they 
encounter difficulties, if sufficient time is provided to achieve mastery and if 
there is specific criterion of what constitutes mastery (Bloom, 1964). 


The theory of mastery learning is anchored on the model of school learning 
suggested by Carroll (1963). Carroll's model states that if students аге 
nomrmally distributed on the basis of their aptitude for a given subject and 
all are given exactly the same instruction (time, amount and quality of 
instruction), then achievement measured at the completion of the subject will 
be- normaly distributed. Under such conditions the correlation between 
aptitude measured at the begining of the instruction and achievement 
measured at the end ofthe instruction will be relatively high. On the other 
hand, if students are normally distributed with respect to aptitude but the 
quality of instruction and time allowed are congruent to the needs of each 
learner, majority of students will achieve mastery of the subject. In such 
aninstance, the correlation between aptitude and achievement will be around 
zero. 

Bloom (1968) agrees with Carroll's model and suggests that the degree 
of learning required should be fixed at some mastery level. In sucha 
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learning system manipulation of instructional variables is the primary concern 
so that all students achieve mastery. In fact, mastery learning programmes 
are indivualized in nature and composed of units or modules, 
hierarchically structured, based on instructional objectives. Each unit should 
be analysed in terms of constituent elements, ranging from terms or facts to 
complex ideas as concepts and principles (Bloom, 1956). It may even include 
application of principles and analysis of complex theoretical statements. Each 
learner is required to work on the unit untill he has achieved a specified 
level of achievement. The learner is considered to have mastered the 
subject, only and only, if he has attained the level. In such a case the learner's 
performance is assesed in terms of what he/she knows and not in relation to 
the performance of others. If he has attained the specified level, then the 
decision is to move onto the next unit. If he fails, then he is required to study 
the material again until he adequately masters the material. Acording to Bloom 
(1970) an effective mastery learning strategy must fulfil the essential 
conditions, such as the learners aptitude for particular kinds of learning, 
quality of instruction, perseverance and time allowed for learning. 


2. CONCEPT OF CRITERION-REFERENCED TESTING 

How does one know whether a learner has attained the mastery level 
on a given unit? Such a question calls for the technique of testing in mastery 
learning. Traditionally, criterion-referenced testing has been advocated as 
the most effective method of testing on mastery learning context. What 
does a criterion-referenced testing mean? Criterion-referenced testing 
compares the level of a student's performance against an identified 
standard or criterion. It identifies what a learner сап do: or knows or has 
attained or is competent in. On an arithmetic test, for instance, Ramu can 
correctly solve 80 per cent of the problems. In this case Ramu's position with 
respect to others is irrelevant, rather, itis Ramu's absolute status in relation 
tothe knowledge of Arithmetic is the only concern. The focus is whether or 
not an individual is able to perform at an acceptable standard. Bloom (1971), 
Popham (1975) and Glass (1978) suggest the need for a 'standard' or 'cut off 
score' to indicate the mastery of a task. For instance, the standard for 
good performance is getting at least 80 per cent of the test items correct on 
the criterion reference test. 

It is natural, then, to know whether a learner has mastered the knowledge 
or skilis necessary to advance to the next level in a learning sequence. It 
means that criterion referenced testing provides information about the specific 
knowledge and abilities of learners through their performances in terms of what 
they know or can do, without reference to the performance of others. 
(Brown, 1981). 

Criterion referenced tests consist of two types; objective-referenced and 
domian-referenced tests. The objective-referenced test is seen as in-adequate 
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since behavioural objectives often lack sufficient clarity for the determination 
of the domain of test items measuring the behaviours intended to the defined 
by an objective. What is the possible alternative in such a situation? Amplified 
objective as an alternative appears to provide boundary specifications 
regarding testing situations, response alternatives and criteria of content 
to which the objectives relate is required so that a well defined per- 
formance tasks called domian can be specified. A domain can be 
conceptualized as а set to hierarchically arranged element. In elementary 
mathematics, for example, the capability for solving multiplication 
problem might be seen as more accomplished if the simpler-capabilities; 
number concept and addition problems have been learned as prequisites. In 
mastery learning, when items are formulated in relation to a given criterion, the 
test can be used to ascertain an individuals status with respect to a well defined 
behaviour domain. The status is determined from a domian score, a score the 
learner may achieve if all the items in the domain are administered. 

In mastery learning, the criterion-referenced test can be used as disgnostic- 
process test to determine which learner has or has not mastered and what he 
must do to complete his learning unit. If the criterion-referenced test is to 
be used for diagnostic purposes, then, Davis and Diamond (1974) argue 
that the items must be homogeneous. There exist two types of 
homogeneity, conceptual homogeneity and response homogeneity. The 
former is reflected to the extent to which all the items making up the test are 
congruent with the domian specification. Whereas response homogerieity 
implies that, given the particular set of items, a learner would be expected to 
either get them all ‘right’ or all ‘wrong’. 


3. CONSTRUCTION OF CRITERION-REFERENCED TEST 
The first task is to formulate the objectives and select the content area 
that it purports to cover. Objectives and content should be indicated with 
clarity so the different individuals will identify the same item pool 
corresponding to the test specifications. Gronlund (1973) suggests 
consideration of the following questions while formulating mastery objectives 
for a given course. 
— What maximum knowledge and skills aré pre-requisites to further 
learning in the same area? 
— What basic skills are prerequisites to learning in other areas? 
— What minimum skill is needed for safe performance in some particular 
activity? 
— What minimum knowledge and skills are needed to function in every 
day? 
Once the mastery objectives have been identified and limits set to the 
content, the teacher is in a position to state his or her general instructional 
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objective. Each general statement could then be broken down into a set of 
behavioural objectives or specific learning tasks on which learners are to 
demonstrate attainment of mastery at the end of the learning experience to 
show that they have achieved the instructional objectives. Precisely, 
attempt should be made to select these learning tasks which represent the 
attainment of the objective. Nevertheless, attention must be given to the 
ability and past learning experience of the pupils to ensure that the 
learning are appropriate. 


4. SCHEME FOR ITEM FORMULATION 


Though different schemes exist, facet analysis is seen as тоге 
appropriate. Facets are viewed as the characteristics on which the items that 
make up the domian differ from each other and facets are often linked 
semantically by a mapping sentence common {о all items (Millman, 1974a). 
Once the knowledge or skill to be assessed is determined, the test developer 
should choose facets and elements within those facets that are believed to 
maximise item variance. It implies the whatever the test is intended to 
measure, the test writer should identify those characteristics (facets) whose 
manipulation may lead to the greatest variation among learners responses to 
these items. 


5. PREPARATION AND ANALYSIS OF ITEMS 

Though the general rules followed for writing test items stand valid for 
criterion-reference tests, construction or selection ofan item depends on the 
degree of its relationship to the specified domain. 

Precisely, the quality of item depends on the degree to which they reflect 
the domain from which they are derived ог adhere to the restrictions 
imposed by the domain definition. In fact, the writer should construct such 
items which may be able to distinguish between those learners who have 
attained mastery and those who have not (Grounlund, 1973). The ideal test 
item will enable the knowledgeable student and only the knowledgeable 
student, to answer correctly. 

Once the items have been written, their adequacy and quality may be 
assessed by judgements made by content specialists. Then independent 
item reviewers evaluate the congruence between each item апа its 
corresponding objective or domain. Brown's (1981) suggestions оп 
techniques for collecting and analysing the judgements of content specialists 
are important in this context. 


5.1.Content specialists judge each item on a three point system, from 
definitely a measure of an objective to definitely not. An index of lem- 
Objective Congruence is computed for each item acros al the 
judgements and valid items are distinguished from non-valid by 
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applying a cutting score based on experiences with content specialists, 
ratings and with the index itself. 

5.2 Specialists rate each item as a measure of the intendedobjeectives 
on a seven point rating across specialists is an indication of the validity 
of the item and the standard deviation of the ratings assesses the extent 
of agreement among the specialists. 


5.8 Specialists match each one of a list of items to objectives on another 
list. A contingency table showing the number of conent specialists 
matching each item to each objective will reveal the extent of 
agreement and identity disagreements. 


In mastery learning the criterion-referenced assessment becomes an 
integral part of instruction and it is crucial to identify high quality items that 
are relevant to the objectives. However, long list of items would be welcomed 
by neither the teacher northe student Assesment based on small number of 
items are highly susceptible to error and may lead to low level of consistency 
of scores. In such a case moderate number of items which may reflect the 
domain can be used. If the learner responds correctly he/she is assumed to 
have the necessary mastery and moes on to the next higher level. 


Norm-Referenced Measurement 
Versus Criterion Referenced 
Measurement 


N. Vasantha Ram Kumar 
K.N. Lalithamal 


ABSTRACT 


Criterion referenced tests have a clear advantage over the norm- 
referenced tests for it can be used both as an achievement and a 
diagnosis test. Unlike norm-referenced measurement devices the 
characteristics of the Criterion referenced test quality are not group- 
based and hence have to be interpreted keeping in view the 
attainability of the criterion level. 

Measurement plays a very important role in the teaching-learning 
process. In order to judge the attainment of pupils accurately and fairly a 
teacher must have accurate measurement device at his disposal and must 
know how to use them. One of the instruments used for measurement is 
tests. Achievement tests are administrated frequently for the purpose of 
asessing the performance of pupils. In these tests intended outcomes are 
seldom specified in terms of expected performance prior to test construction 
and hence a score is meaningful only in cornparison with the scores of others 
taking the same test. 


NORM-REFERENCED TESTS 

These tests are designed to rank students in order from high to low 
so that decisions based on relative achievement can be made with greater 
accuracy. Items that provide a wide range of scores alone are selected for 
these tests. One of the most common techniques for improving the quality 
of these tests is to compute an itern discrimination index for each item in the 
test and select items having a satisfactory discrimination index for the test. 
Items that all pupils are likely to answer correlctly are not included in these 
tests. The performance of an individual is interpreted according to the 
performance of other individuals on the same measuring device. This type of 
interpretation enable us to determine how an individual's performance 
compares to that of others. 


USES OF NORM-REFERENCED MEASUREMENT 
Norm Referenced Measurement is useful : 
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(1) in aptitude testing for making differential prediction; 

(2) to get a reliable rank ordering of the pupils with respect to the 
achievement we are measuring. 

(3) to identify the pupils who have mastered the essentials of the course 
more than others; 

(4) to select the best of the applicants for a particular programme; 


(5) to find out how effective a programme is in comparison to other possible 
programmes. 


DRAWBACKS 

Some of the criticisma raised against these tests are: 

i. Test items that are answered correctly by most of the pupils are not 
included in these tests because of their inadequate contribution to 
response variance. They will be the items that deal with important 
concepts of course content. 

2. There is lack of congruence between what the test measures and what 
is stressed in a local curriculum. 

3. Norm-referencing promotes unhealthy competition and is injurious 
to self- concepts of low scoring students. 

The measurement of achievement of learners has been the major 
concern of the evaluators of the past. But today's evaluators are more 
concerned about improvement of students’ learning and diagnosis of 
students weaknesses and inadequacies ininstructional strategies than 
mere assessment of their achievement. This lead to the development 
ofa new type of measurement called criterion-referenced measurement. 
Realising the inappropriateness of applying traditional measurement notions 
to situations involving moder instructional techniques Robert Glaser wrote 
an article on “Instructional technology and the measuring of Learning 
outcomes: Some Questions” which catalyzed interest in this measurement 
issue. 


CRITERION-REFERENCED MEASUREMENT 


“A criterion-referenced test is used to ascertain an individuals status with 
respect to a well-defined behaviour domain” (5:130) “It is a test that is 
deliberately constructed to yield mesurement that are directly interpretable 
in terms Of specified performance standards” (6:2). 

The word criterion in criterion-referenced tests denotes “an instructional 
objectives, an expected post-instructional learning outcome, an intended level 
of a student's performance, an acceptable level of learner's achievement 
ога desired standard of product of performance" (6:1). 

The focus of criterion-referenced measurement is on diagnosing students’ 
inadequacies in learning and improvement of instructional strategies. 
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Therefore different domains are defined in terms of criterion regarded as 
intended learning. Then test items are prepared. As in norm-referenced test 
item analysis is an essential technique of judging the quality of criterion- 
referenced tests. Facility index and discrimanation index of items are 
computed for selecting items for a criterion-referenced test. Facility index or 
difficulty level in criterion-referenced test can go upto 100. Facility index should 
normally vary from 0.80 to 1 in criterion-referenced tests. 

In criterion-referenced tests the items are passed by high proportion 
of students and as such discrimination expected is much lower than those 
of discriminatory tests. Minimisation of gap between masters amd non- 
masters is the focus of criterion-referenced tests and as such high 
discrimination index cannot be expected. Discrimination index is calculated 
by using the formula 


рл = р ТЕЕ a (6:21) where 


В. = Number of examiners who passed the total test and answered the 
item correctly. 

В, = Number of examiners who failed the total test and answered the item 
correctly. 

np = Number of examinees who passed the total test. 

nf» Number of examinees who failed the total test. 


Criterion referenced tests are interpreted concept-wise, objective-wise 
and individual wise. Concept-wise analysis is to find out the level at which a 
particular concept has been learnt. Domain-wise analysis can reveal the 
appropriateness of domain description and definition. Objective-wise analysis 
helps to find out the level of attainment of various objectives. Student-wise 
analysis helps to identify the masters and non-masters of a particular 
course content. 


COMPARING THE TWO MEASURES 

In norm-referenced tests a score of one person is interpreted by comparing 
his score to those of others while in criterion-referenced tests an 
individual's performance is interpreted by comparing it to some specified 
behavioural criterion of proficiency. 

Since criterion-referenced tests are not concerned with relative 
achievement of pupils item difficulty and the power of items to discriminate 
among pupils, are not used as criteria in item selection. The items are 
selected on the basis of how well they reflect the learning tasks being 
measured. but in norm-referenced tests difficulty index and discrimination index 
of items are used as criteria for selecting items. 
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Both criterion-referenced and norm-referenced measurements can serve 


the four basic functions of evaluation in class room. However the functions 
of formative and diagnostic evalution are likely to be best served by criterion- 
referenced measurement and those of summative evaluation by instruments 
that are norm referenced. Placement evaluation is likely to require both 
criterion-referenced and norm-referenced evaluation. 


USES OF CRITERION-REFERENCED MEASUREMENT 


This type of measurement is useful. 

(a) to discover the inadequacies in pupils' learning and help the weaker 
section of students to reach the level of other students through a 
regular programme of remediation. 

(b) to identify the masters and nonmasters in a class. 

(c) to find out the level of attainment of various objectives of instruction. 

(d) to find out the level at which a particular concept has been learnt. 

(e) in better placement of concepts at different grade levels. 

(f) in individually prescribed instruction programme and mastery learning 
model of Bloom to make instructional decision of what to do with a 
pupil. 
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Prospects Of Criterion Referenced 
Tests In The Context Of The National 
Policy Of Education 


B.S. Raizada 


ABSTRACT 


The paper uses funnel technique to discuss the prospects 
of criterion referenced tests. It starts with justifying the role of 
evaluation in education, passes on to discuss the nature of two 
competing evaluation measures, discusses the situations which 
favour them most and finally discusses the prospects of 
criterion-referenced tests in our present educational set-up. The 
prospect is discussed within various measurement domains and іп 
relation to various emphases in our educational system which has 
such priorities as child centred education, universalisation of primary 
education and vocationalisatin of education. The paper seeks to 
establish bright prospect for criterion referenced tests in helping to 
achieve the desired outcomes in these priority area of our present 
educational system particularly in the context of the National Policy 
of Education 1986 supported by sophisticated computer technology. 


1. NEED FOR EVALUATION MEASURES IN EDUCATION 

Evaluation is an integral part of any education system. Education is 
modification of behaviour in a predetermined desired direction and educators 
or teachers are called upon to lead their students in their educational 
endeavours, towards this desired direction. However, while performing this 
sacred duty they are faced with a number of situations where they have to take 
crucial decisions. These situations may arise at various stages of students’ 
progress towards the cherished goals of education or may be the outcome of 
specific demands of a particular educational system. Various evaluation 
techniques help teachers in taking wise decisions in such crucial situations. 

Evaluation is conceptualised as the process of collecting relevant data 
to provide information which help in taking wise decisions in educational 
situations. And this process of collecting data is conducted with the help of 
some kind of test measues. These tests provide teachers with more 
objective information on which to base their decisions. 
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E ey ibrar pateant 

Various uses of testing have been summarised by Gronlund (1982) under 
two main heads. 

@ For relativo ranking of students 

(8) For describing the learning tasks they can do or can not do. 

Pertaining to these uses of tests wo have two types of test measures; 

and criterion-referenced tests, Test results of nom 

rolerenced measures аго і n terproted in terms of each student's rotativo 
raking among other students. For examples student. A is fifth highest on 
test п а class of 45 students. In criterion referenced measures on the other 
hands the test results are interpreted in torma of the specific knowledge and 
skills each student can demonstrate. For example, he can entity ай parts 
of a typing machine and. demonstrate their proper use. 

Undertying the concept of crierion-delerencod measure is the notion of 
a continuum of knowledge acquisition which extends from zero proficiency to 
робни proident, Тра lavdi of 6 dn hss some РАН ОЛЕ 

critorion-roferenced-tost 


performance is needed The point is that specific behaviours implied at each 
level of proficiency can be identáod and used to describe the specific tasks 
a student must be capable of pertorming bolora he achieves one of those 
levels of profancy. “Measures which assess student achievorment in. terms 
of criterion sandari hus provide information as Yo the degree of competence 
stained by a particular student which is Wedependent of the reference to the 
pertormance of others" (Hubert А Taylor 1979). 

Again n choon referercedtosting an dividuals performance @ 
селово) in terms of absolute of specific criterion that has boon sat for him. 
Jor езт recognition of (ЖУ, of the prete in a Vet of 50 words may be the 
criterion set tor one student whereas. the recogeition of ali sight words in a 
fet of 10 wende maybe the criterion tor another student. (Hallahan and 
Калап 1976) 


3. EDUCATIONAL SITUATIONS WHICH FAVOUR CRITERION- 
REFERENCED TESTS 
fhofore assessing the prospects of criterion-rofererced tests in the 
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the necessary shiis and abilities to begin the insiuction, Sampling 
considerations aro to include each. prequsito ontry behaviour and hence bom 
included in the test are typically easy and criterion referenced. However, 
the intention is to determing entry performance 


concerned with appraising different means to achieve the stated goal and 
develop appropriate testing programme that may help in evaluating the 
achievement towards the stated goal. So itis necessary to be clear regarding 
the demands of present day education to decide which type of testing 
programme (s) may be suited to a particular situation. 


The present day education emphasises child centered éducation. Child 
centered education was not only advocated by the great sociologist 
Rousseau and thinkers like Pestalozzi, Froebel, Montessory, Dewey etc. 
but was necessitated by a number of pressing factors such as 


3.2.1. Advent of Democracies 


The strength and success of democracy is invariably linked with the 
strength and success of people constituting the democracy. Democracy 
therefore emphasised the supreme worth of the individual and 
consequently the imperative need for his best growing. 


3.2.2. Need for Conservation ої man-power 


Every conscious nation realises that not only its progress but its very 
survival is inseparably linked with the effective use of its man power. Today 
the progress of a nation is not judged on the basis of its natural resources, but 
how best it has been able to master its human resources to make their most 
effective contribution in the nations welfare and progress. Child centred 
education and better management of such education becomes therefore 
more and more essential to tap the intellectual resources of the country 
and divert them to the most effective channels of action which not only bring 
pleasure and satisfaction to the individual but also contribute towards the 
national prosperity and progress 


3.2.3. Changing Conditions of Human Survival 


Man now being more civilised is bound by many more social laws in his 
action and behaviour towards others. To exist and ensure his survival he can 
not take liberties as he desires. The utility of his sheer might in his survival 
has ben considerably reduced. Human survival in the present day 
competitive, industrially oriented civilisation has become distinctly more 
of anintellectual sort. In general more successful man in life are those who 
can make the best use of their mental faculties. Needs and problems of 
education which may help an individual in better adjustement with his 
environment are therefore becoming more and more complex. Education is 
needed to make individuals more worldly wise. 


3.2.4. Obvious Fact of Individual Differences 


Inspite of all challenges of education human limitations can not be 
overlooked. There are varied differences in the potentialities and nature of 
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different individuals. Quality апа quantity of learning capacity in diferent 
individuals are not the same. Some can learn much more and much more 
quickly than others. Similarly some can be much more efficient in some areas 
of learning while being very poor or just average in some other areas. Every 
child has therefore to be taken in his own right when the question of his 
education arises. Education is for him not that he is for education. Education 
will be more meaningful and motivating if it is in tune with the potentialities and 
nature of the child. 

Above thrusts on education have resulted in three major foci in our National 
Policy of Education. 

(a) Universalisation of primary education— As the strength of a chain is the 
strength of its weakest link the strength of a democracy ‘сап be assessed 
with the strength of its weakest section. If, a big chunk of its population is 
illiterate and unconscious of its potentialities and obligations it cannot 
contribute its due share in the maintenance and progress of the social order 
of which it is a part. Realising this basis fact the framers of our constitution 
under Article 45 have made it a National obligation to provide for free and 
compulsary education for all children in the age group of 6-14 years. 

(b) Individualised instruction— This becomes necessary to draw out the 
best in the child and provide him opportunities for the maximum development 
of his potentialities which not only help in his personal advancement but also 
bring his best contribution in the progress and welfare of the country. Such 
education also help a child in his unique adjustement with his environment, 
keeping in view the strengths and weaknesses of his personality. 

(c) Vocationalisation of Education— Concept of democracy visulaises 
maximum welfare of maximum numbers. Vocationalisation of education is 
intended to help majority of students to become self-sufficient and getting 
prepared for some vocation suitable to their aptitude. 


4. PROSPECTS OF CRITERION REFERENCED-TESTS IN SUCH 
EDUCATIONAL SYSTEM 


4.1 Role of Criterion-referenced tests in the Universalisation of Primary 

Education. 

As we have seen universalisation of primary education is our 
constitutional obligation which implies universal enrolment and universal 
retention upto 14 years of age. Chief objectives of primary education during 
this period are to develop certain basic personal and social skills in children 
such as. 

4.1.1. Equipping them with effective means of communicating with their 
environment through literacy (verbal and written communication) 
numeracy (simple manipulation with numbers) and techniracy (scientific 
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method of enquiry). 

4.1.2. Developing in them those basic attitudes traits and habits which are 
personally useful and socially desirable such as personal cleanliness and 
hygiene, respect for social norms and symbols, diginity of labour, asethetic 
sense and cooperative way of living. 

Thus the emphasis in primary education is to develop in a child basic skills 
of observation, communication and adjutment with his environment and 
condition him to certain way of life. The emphasis here is not so much оп 
grading the children on these basic skills as to help them master them. 

On the evaluation side it has been suggested that 

(i) по rigid evaluation should be imposed 

(ii) evaluation should be integrated with the process of learning. 

(iii) there should be continuous recording of the child's progress. 

(iv) normally all children should be promoted, however, special attention 

should be given to those who do not show adequate progress. 

To sum up, the main demands on primary education are; 

(i) to help the child master certain basic skills. 

(ii) continuous progressive evaluation to record his progress. 

Here learning tasks are not very varied and complex. It means they can 
be easily and clearly delineated. Emphasis is on mastery of these tasks. Finally 
remedial instruction is suggested where necessary. Allthese situations can be 
most appropriately taken care of by criterion-referenced tests. Whatever we 
have said above about criterion-referenced tests, make it highly 
appropriate to attain the objectives of Primary Education through them. To find 
out the extent to which students have mastered the learning outcomes 
formative testing may be used and where necessary diagnostic testing 
may be used to decide upon the nature of remedial teaching work. In all 
these cases criterion-referenced tests will be best suited to assess 
individual students' exact position on achievement continum and to diagnose 
specific areas of his learning difficulty (if any) so that he is helped to smoothly 
progress towards the desired goal. “Among the meany advantages of criterion- 
referenced test, flexibility is using this type of test for various individual 
requirements and continuous assessment for noting individual student's 
progress, are the most prominent ones. They are adoptable to any type of 
curriculum particularly for the learning of the handicapped" (Proger & Mann 
1973). 

4.2 Hole of Criterion-referenced Tests in Individualised Instruction 

Individualised instruction emphasises three basic things. 

(i) Specific norms for the education of individual students, 


Learning expectations should not be the same for all the students. 
They should be matching with their unique personality pattern апа 
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potentialities. If we expect а child to attain an achievement standard which 

is too high for him it would only bring failure and consequently frustration even 

apathy towards education. all these may result in his energy canalising into 

undesirable directions. Similarly too easy criteria may also pose problems. 
(ii) Elimination in most cases, the concept of failure and success. 


If the educational norms set for the child are in tune with his needs, 
aspirations and potentialities, chances of failure will be eliminated to a great 
extent. 

(iii) Competition with ‘self’ instead of with others. 

In individualised instruction the anxiety of a teacher is not so much on how 
does a particular student stand in order of merit among other students, but 
more so on how much he has achieved against the criterion set for him. 

It is clear therefore that with such emphases in individualised education, 
criterion referenced tests are much more suitable for plaement work and 
monitoring the progress of students towards their desired ends. 

4.3. Role of Criterion-reference Tests in Vocationalisation of Education 

In Vocational education emphasis is mostly on developing those psycho- 
motor skills which help in proficiency on certain vocation. They are clearly 
delineated. Criterion-referenced-tests again can be more suited in such a 
situation to assess how for a student has attained these skills апа how much 
more he has to improve his performance in order to achieved the desired level 
of efficiency 

i) Besides these criterion-referenced test are built and used to assess 

carefully prescribed behavioural objective. 

ii) are more sensitive to changes brought about by interventions. 

iii) permit operational classification of success mastery 

iv) can be interpreted directly. 

v) Work simulation approach, now begining popular in educational 

vocational planning and counselling, also seems to be an 
appropriate domain for criterion-referenced tests. 


5. LIMITATIONS OF CRITERION-REFERENCED TESTS 


5.1 Scope of criterion-referenced testing is limited at best to the secondary 
stage of education because of its typical nature as described earlier. 

Beyond this stage of education curriculum becomes so varied and 
complex that criterion-referenced tests gradually lose their appropriatcness. 
As Rankin (1971) points out, "Evaluation of a broad curriculum has to refer 
to the assessment of the student achievement on a large number of learning 
hierachies. A large set of criterion-referenced tests spanning such a large 
number of hieararchies might yield information highly correlated with that from 
appropriate norm-referenced standardised test, but then the chances are that 
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the standardised test is the more efficient evaluation instrument’. Criterion- 
referenced tests have their greatest relvancy only in the assessment of 
position within a specific learning hierarchy. 

5.2. The value of a particular schooHearning should, more appropriately, 
be considered in terms of its transfer value. here two types of transfer may be 
visualised: lateral and vertical. Lateral transfer refers to the processes in 
which "the capabilities specially learned in school should enable the 
student to perform some acts of practical value to him, whether in his every 
day life or in connection with an occupation" (Gangc, 1970 p 335). Vertical 
transfer on the otherhand" refers to the efects that learned capablities at one 
level have on the learning of additional ones at higher levels" (Gagne 1970 
page 335). Learning hierarchies can make explicit the vertical transfer 
values of what is learned and this can act as a critcerion to be tested by 
criterion-referenced test before the student is allowed to go up in his 
instructional experiences. However, in lateral transfer criterion-referenced 
testing may not be much useful. For measuring lateral transfer tests should 
be such as to have predictive validities for some extra school behaviours. 
Such a situation therefore precludes the use of criterion-referenced tests 
in their own right. 

5.3 Again as Glaser (1968 p. 33) points out "it is conceivable that 
individualised instruction will find its major value in attaining not only 
achievement objectives but other educational goals such as self-direction, 
self-initiation of this learning and feeling of control over one's learning 
environment. Success in reaching these outcomes of learning is difficult 
to measure....', because such outcomes are not carefully specified in the 
stipulated hierarchies of behavioural objectives. They may be viewed as 
incidental learning although they may be of utmost importance. Criterion- 
referenced tests will not stretch to cover them. 

5.4. Similarly in testing the creative and productive abilities of children 
criterion-referenced tests have their over limitations. The horizontal 
dimensions and vertical hierarchies of such abilities are not clearly delineated 
and hence criteria for these dimensions and hierarchies can not be easily 
fixed. Summative evaluation is the only way of assessing the relative rank of 
a particular student in a particular group. 


6. FINAL ASSESSMENT 


Inspite, of all these limitations prosepects of criterion-referenced tests 
are quite bright particularly in the light ofthe National Policy of Education 
1986, dueto the following reasons. 

6.1. Different types of measuring tools of evaluation have their specific 
relevance only upto secondary stage. After that the controversy and choice 
have little meaning. And upto this stage in most cases criteria can be clearly 
and easily specified and evaluated through criterion-referenced tests. 
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Education upto secondary stage is mostly foundation education in which 
basic skills and abilities are tobe developed. They are not so complex as 
they gradually become in higher education and hence can be taken care of 
in most cases, by criterion referenced tests. 


6.2 In vocational education also the emphasis is on developing certain 
psycho-motor skills which can be evaluated only in terms of criterion referenced 
tests. j 

6.3. All said and done it does not mean that norm-referenced tests 
should be abolished even upto Secondary stage of education. They have their 
own advantages and are likely to be in co-existence with the criterion- 
referenced tests even upto secondary level and even in certain aspects of 
vocational education. 
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New Education And Criterion Refer- 
enced Measurement 


N.P. Banerjee 
ABSTRACT 


Priority organization of values is a variable accross time. 
Change in the value priority is reflected in the aims of 
education and its system. Universalisation and equalisation of 
education have been the priority values of the day. Equalisation at 
the provision level is not meaningful, particularly in education 
where equilisation must be at the output level to minimize 
wastage of human resources. The existing Norm Referenced 
Education with its parafernalia does not have any scope to do 
this. The society is in search of an alternate model. Researchers 
have come close to the true answer. Criterion Referenced 
Education through Mastery Learning approaches is a close 
approximation to it. Criterion Referenced Measurement with its 
necessary tool, Criterion Referenced Tests, is an essential 
subsumed subsystem of Criterion Referenced Education. 
CRM, necessarily has its own logistics and techniques of tool 
construction, evaluation and interpretation. Even if those are not yet 
fully known in their details the linguistics and techniques of the 
alternate system cannot be imposed on it. Its own way and own 
logistic are to be established. 


1. INTRODUCTION 

As in all growing areas of knowledge during the last century researches 
in the field of Psychology and Education began to understand the 
processes ofthe mind and life by exploring different aspects of structures, 
functions and operations. Many dogmas and beliefs evolved through 
common experiences and reflections of individual persons dominated the 
field. As a challenge to it the early trends had been researches to 
understand specific operations separately. During the early decades ofthis 
century psychologists directed their efforts to underatand operations like 
learning, remembering, forgetting, transfer etc. Theories of learning evolved 
from laboratory situations. Teacher-psychologists put their efforts to 
understand teaching, exposition, communication, models of teaching, 
strategies of teaching etc. in the class-room situations. 


1.1 Progress of human civilization during the later decades ofthis century 
has brought about many changes in the way of human thinking. The two 
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important ones are: acceptance of the empirical way of verifying hunches and 
guesses as true relations as the ultimate, and the dominance of democratic 
rights of individuals in a morphological organization of the human society over 
the social value of priority and preferential rights іп a hierarchical or 
group dominance organization of society. These two have brought about 
changes in almost all the areas of human thinking. this has equally influenced 
the thinking in the fields of psychology and education. Teaching was 
considered to be a purposeful activity and the prevalent social thinking 
determined its frame and ways of judgement on the basis of selectivity and 
hierarchical organization. It was acepted that from any closed population 
groups may be selected in which members have the ability to learn certain 
things and contents of certain complexity while others beyond the group do 
not have the ability. The hierarchical order of learning ability remains stable 
across time. This generated systems of selecting groups from the population. 


2.0 SOCIAL VALUE AND EDUCATIONAL MEASUREMENT 


Measurement in the fields of psychology and education was oriented 
to this social value frame of selectivity and hierarchical organization. This 
generated faith in the high positive relation between primary abilities of 
children and educational probabilities of different groups of children. This 
eased the administration of the problem of equalisation of opportunities. Equal 
or same facilities at the provision level was taken to be justified. — It 
necessiated selection and classification ef children for educational mana^al 
ment. The tools and techniques evolved in the area of measurement were 
geared to it. Tests were designed to understand the status of an indivicual in 
relation to the group. The consideration being equalisation at the provision 
level the onus of receiving it and drawing benefit from it on the pupil. 

2.1 The faith of the psychometricians in the wide applicability of the 
theory of normal distribution in understanding the individual in relation to 
the group, the group in relation to other groups and the population took them 
too far. The purpose was lost in the jungle of techniques. They tried to extend 
the use of the theory to explain facts and phenomena which were not varified 
to be even normally distributed. The necessary outcome had been wasteges 
and misfits. Tests and measures with the assumption of some pre- 
determined distribution were claimed to be refined by referring them to the 
same situation, i.e. the Norm. The purpose of such measurements were 
selection and ranking. These were later known as Norm Referenced 


Measures. 


3.0 SYSTEM APPROACH TO UNDERSTAND EDUCATION 


During the sixth and seventh decades of this century two important 
trends of thinking were identifiable. The whole process of education began to 
be considered as a system, each specific operation finding its place in the 
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matrix аз a subsystem; and the acceptance of the democratic value of each 
individual as an unit of human resource in the human societal whole. This 
claims for the equalisation at the level of output. 

3.1 This re-orientation of human thinking led to several revolutionary 
changes. The selective modality of education is replaced by the differential 
mode of operation. Necessarily the reference has shifted from the status 
of the individual in relation to a contemporary group to the trajectory of his 
own growth. the individual has been accepted as his own reference atleast 
during the period of fast growth in his own life. The principle of selection- 
rejection has been replaced by that of description, diagnosis and placement. 
Even the most disadvantaged is not considered as a social liability to be doled 
out with social charity, but is considered to be a social asset on his own right 
and the social machinery has to accept the responsibility to make the best of 
him. 

3.2 The modern way to understand any operation is the systems 
approach. The two basic principles of this approach are; The whole minus 
a unit no further remains the whole and looses its significance as a whole; 
and aunit does not have any significance or meaning in isolation, the unit 
must fit in its place in the whole to have it and adequacy of the fit generates 
the adequacy of the whole. 

3.3 Smaller wholes organise themselves to generate the greater whole, 
conversely, a larger system accomodates a number of smaller sub- 
systems. Each sub-system can then in turn be looked upon as a system to 
have subsumed further sub-systems. 

3.4 Education is such a large system. In it, the smaller system of class- 
room teaching is subsumed. Class-room teaching-learning in its turn is 
organised incorporationg a number of smaller ystem. Evaluation finds its 
place in the operational chain of teaching-learning. 


4,0 VALUE PRIORITY AND TEACHING LEARNING INTERACTION 


In consonance with the emergent value priority Jerome Bruner (1970) 
and others claimed that, provided sufficient time for interaction be available 
any content to be learnt can be taught to any individual through the appropriate 
interaction. They have claimed empirical evidences of the validity of the idea 
under normal school situation. 


5.0 MASTERY LEARNING APPROACH AND CRM 


An elaboration of this idea led Bloom (1971) to claim that at the normal level 
of learning of the normal group almost 90 or 95 p.c. of the normal group can 
master the learnable content where mastery ws considered to be learning of 
90 p.c. ofthe elements to be learnt. This phenomenon was termed as 
mastery learning. Mastery learning has put forward its claim to be the criterion 
of the effectiveness of teaching-learning strategies. Since expression 
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of this view in 1971, Bloom апа other researchers (Bloom, 1971, 1976; 
Block, 1971, 1974; Block and Burns 1976) have carried out researches to 
varify the validity of this proposition. Models of teaching for mastery 
learning and relevant strategies have come out. These strategies require a 
different way of tackling the issue of measurement and evaluation where 
achievement means something else than it meant earlier. CRM, i.e. Criterion 
Referenced Measurement, has been identified as an important link as а 
control and monitoring agency in this evolving system of teaching- 
learning. Like all evaluation procedures CRM is a small system which 
involves the use of CRT, i.e. Criterion Referenced Tests as its operation tool. 
CRT or CRM in isolation does not have any significance and the system of 
teaching-learning on the line of mastery learning approach is incomplete 
without CRM and is without significance. 

5.1 To make the system of TL, i.e. Teaching-learning through mastery 
learning strategies it is necessary to understand the system in its details and 
the place of CRM and CRT in it. Such an analysis help to identify the pre- 
requisites of the process and the attributes of the tools. In the following 
section an attempt has been made to place CRM and CRT in its proper 
perspective. 


6.0 THE TEACHING-LEARNING PROCES 

From recent researches on human learning under normal conditions some 
concepts of field theorists have been found relevant. One of these is the 
structuring of the cognitive map of the content òf learning by bits, learning 
when the map with all the bits and the structure is complete. Before it is 
complete it is only accumulation of bits of learning elements. Even when all 
the bits are acquired but no structure has evolved there is almost no learning. 
The moment the structure is evolved there is learning all at once. This is close 
to the claims of Gestalt school as well. Bruner (1971) emphasized on the 
importance of the structure. Gagne’ (1968) has given a scheme of learning 
which can be extened to accomodate later concepts. 


Sequence of learning phases-Schematic 
1 2 
Apprehending phase 


Attending, perceiving 
Coding 


3 


Acquisition phase 
Acquiring 


Stimulus 
situation 


4 5 6 
learning Storage phase Retrieval phase 
of elements retention, Recognition, recall 


transfer of elements 


memory, storage 
of skills 
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ya 8 9 


Structuring phase Performance phase Integration int 
Structuring of the map Observable outpute personality. 
stabilizing the structure Transfer, application 


in the system of knowleage Problem solving 


It can be observed that learning at the mastery level upto the sixth phase 
is essential for the reduction of human wastage and provision of equal facility 


is the new strategy for the task, 


6.1 The whole chain of phase sequence may be restrutured and 
cpresented as two systems scheme for two levels of learning. 


Input Process Output 

| Content elements Pupil acquirement of 
(level, structure content elements 
organisation (Learning of discrete 
Pupil abilities TLI elements with a low 
Pupil Motivation level of transfer 
Pupil effortx time competency.) 
Pupil personality 
Environmental 
conditions 
Teaching cues 

l| Learnt, content elements 
pupils previous 
experiences 
Pupils structuring Incubation Learing of the 
faculty Structure (with 
Pupil motivation a high level of 
Time transfer competency.) 
Pupil personality 
Environmental 
conditions 


Teaching cues 


————— ——— ee 


Several points are of much importance, viz. 

a. Even with some gaps in the acquirement of learning elements gaps 
are ólosed by substitutes and a structure evolves. The structure is 
less likely to be content valid and is less likely to fit in the knowledge 
whole leading to wrong learning. 

b. Elements must be available beyond threshold learning before 

incubation takes place with or without cues. 

Pre-existing adjascent structures help in the structuring of the map. 

Content characteristics involve type of material, level of structure 

complexity and form of communication organization. 

е. Pupil characteristics along with other stable ones include learning 
styles. Learning styles interact differentially with different 
communicative forms. 

6.2 From the above considerations two forms of leaming situations 
can be deduced: 


a. Differential learning of content elements; dilferent individuals 
learning diferent amounts and having different amounts and types 
of gaps. This ends in some being able to incubate the valid 
structure, some others a distorted structure and a third group no 
structure at all, 

b. Uniform leaming with appropriate organization of the input variables 
and adequate monitoring of the process variables, This ends in pupils 
incubating valid structures with or without cues, 

The first situation is the traditional class-room TIL with fixed Time X 
exposition. The second is that of the Mastery learning with monitored TLI 
fixed achievement target. The target of achievement is the Criterion of 
Effectiveness of the TLI. 


7.0. LEARNING ORDER AND SEQUENCE 

From a similar researches Gagne’ (1968) has attempted to establish 
a learning hierachy on the basis of operation complexity, transfer 
competency and operation sequence. He is of view that lower leaming is a 
pre-requisite for next higher order leaming. His chain is : Signal leaming— 
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Bloom (1974) in evolving the taxonomy of educational objectives in the 
Cognitive domain has developed a similar chain as: Knowledge — 
Compehension — Application — Analysis — Synthesis — Evaluation. 

this then there are other objectives of learning for which human 
functions in all the three domains and integrated functions are necessary. 

7.1 At the formal school level TLI in the classroom is primarily oriented 
to the learning of knowledges, development of concepts learning of skills, 
application of what is learned and solving problems. 
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8.0 SOURCE AND MODES ОЕ TLI 

From an analysis of various forms of learning from various sources 
classroom learning can be understood from the following scheme. 

Learning of knowledges, concepts, skills and applications is taken as 
learning at the ordinary level and learning of analysis, synthesis, evaluation, 
appreciation etc. is taken to be learning at a higher level. 

In the classroom the teacher takes to group teaching. The two situations 
of TLI have been discussed earlier (6.2). In each of the situations the three 
primary ways are given here. 

Source and Mode of learning and Teaching 

Teaching-Learning 
(Ordinary level learning) 


Mass Teaching SystematicTeaching Incidental 
Learning 


Public Address Mass media 


Systems communications 
Group Teaching Individualized 
teaching 
isual A.V. Audio al / A 
Exposity Self Guided 
Teaching Discovery learning learning 
learning 
Developmental 
teaching 


Inthe Classroom the teacher takes to group teaching. The two situations 
of TLI have been discussed earlier (6.2). In each of the situations the three 
primary ways are given here. 


9.0 MASTERY LEARNING OPERATION SEQUENCE 

Researches by Skinner (54) and others evolved the modern strategies 
of individualization of TLI. Some cocepts they evolved contribute to the 
development of group teaching as well. 

Bloom expressed (1968) that most students (perhaps more than 90 p.c.) 
can master what we have to teach them. Bloom also claims (1971) that when 
pupils learn at the mastery level the rate of learning gradually increases. 
Slow learners gradually come up along the fast learners i.e. individual 
differences in respect of rate of learning decreases. 


9.1 Although this has generated a controversy between the 
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proponents of mastery learning and critics of mastery learning yet it being 
the only way visualised so far towards equilization of education the strategy 
of mastery learning has to be refined, adapted and used. “Critics argue that 
under typical schooling in which time is held constant, individual differences 
between students are reflected in differences in achievement outcomes. If 
achievement outcomes are held constant as in mastery learning then 
individual differences between students will be reflected by difference in time 
to learn, or learning rate.(Marshall Arlin and Janet Webster-83). After a short 
term experiment (reported-83) they are of opinion that the hypothesis of 
reduction of individual differences in rate of learning is yet to be varified 
through empirical researches. But Nobody denies that achievement can be 
equalised. There is no indication in the report that all the learning variables 
like matched-unmatched conditions were considered. 

9.2 In some studies in India (Pan 1985) the researcher finds that with time 
of learning constant, under matched condition individual differences in 
learning outcomes is reduced. 


1st Cycle Entry level evaluation 
Entry level v^ ny as ct Remedial 
Preparation Ready learners Unready learners —* Teaching 


2nd Cycle Orientiation to learning goal. 
TLI for (Determination of well described learning domain) 
Unit teaching 


Teaching Learning Interactions 


Evaluation 
(Progressive - Descriptive) 


3rd. Cycle 7 фе 


Remedial Masters Nonmasters Remedial 
Teaching | Toachiing 
| Gap identification Pn 


To text Unit 


The above flow chart shows that among other facilities it requires 
adequate evaluation tools in all the cycles. The essential pre requisite of 
this system is a technique to dichotomize the treatment groups at each 
cycle, into ready-unready at the first cycle and into master nonmasters in 
the later cycles. The second pre-requisite is a technique to identify the learning 
gaps. Criterion referenced tests are appropriate tools to meet these two 
needs. 
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10.0 CRM and its Logic 

The most important aim of education for the twenty first century can be 
identified as empowering the child. In the field of learning gradual reduction 
inthe time/achievement ratio or gradual increase in the rate of learning 
indicates improving the learning power of the child. The TLI system presented 
earlier (6.1) shows that only from mastery learning in system I the child can be 
empowered to learn in system II. 


10.1 The alternate systems of nonmastery learning has through the years 
evolved its own operationl sub-systems. These sub-systems have evolved in 
turn their own logic and relevant tools and techniques. Although it is difficult 
to break-through the closure of time honoured logistics it is an essential need. 
For example, in the development of NRM and NRT, item selection looked for 
an index of general discrimination power which does not have any relevance 
in item selection for CRT construction. Here the only discrimination is required 
atthe mastery-nonmastery point. An NRT is sample stimulus situation from 
the universe of stimulus situations in the content of area. But in CRM items 
of a CRT are not samples but a stimulus situation to test the content element. 
It does not have any alternative. The only scope of selection is when two or 
more items are constructed from the same content X task stimulus. The need 
here is improvement of the item. 

10.2 In CRT each item is content specific in the domain of testing. To 
improve the item the process flow of a test operation needs be clear. Ina test 
situation three operation systems are cfound, viz. 


Communication reception—Problem solving operation—Expression Important 
variables at each step may be identifed as; 


Reception Content retrival | Coding the product 


Decoding Operation Expression Test indicator 
Perception || Structuring Transmission 

Coding for Motivation Motivation 

operation Use of external 


Motivation facilities 


Although operation at the central frame is the test target, lower ability at any 
ofthe terminal frames may lead to wrong indicator and the operation be judged 
as a failure. Abilities for operation at the terminal frames has been termed as 
necessary ability. In NRM, an item being easily replacable, test makers are 
not keen about the analysis of item variables. In CRM item judgement it is a 
must. Instead of being a psycho-statistical it is psycholiguistic in nature. 
By looking in to the alternates the item can be restructured. 

10.3 In evaluating tests indices like reliability and validity are considered. 
The frames of consideration are different їп NRM and CRM. In the case of 
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NRM a test-retest or a split half or some other index is used to understand the 
measurement stability and test homogeniety. These do not have any similar 
reference in CRM. A different index with a different attribute reference is 
necessary for the evaluation of CRts. Similar is the case with validity. 

10.4 Like the omnbius use of the NRM, CRM cannot be used for 
multiple purposes. It is clearly purpose-specific. Imposition of tradition logic to 
CRM does not have any justification. The logic of CRM has to evolve through 
research nd evaluation. 
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Criterion Referenced Assessment in 
Special Educaiton 


Sushil Kumar Goel 


ABSTRACT 
The use of criterion-referenced assessment in special 
` education has a distinct advantage given its utility for defining what 
loteach and what aspects to emphasise with different groups 
performing at different levels. The available criterion 
referenced tests and its suitability for Indian schools is 
discussed in detail 


1. ASSESSMENT OF CHILD'S PROGRESS 

Early childhood programmes can be evaluated in many areas (e.g., child's 
progress, parent involvement, cost effectiveness curriculum impact, 
programme impact). These areas invariably everlap. Evaluation is a 
systematic process by which judgements are made about the relative 
desirability, adequacy, effectiveness, or worth of something, often 
according to a definite criterion or standard, for a specified purpose (Goodwin, 
1974). The effectiveness of a programme, particularly one that includes 
young handicapped children, is invariably measured by the skill gains effected 
by the intervention. 

The many methods of assessing the educational performance of 
exceptional children can be divided into two categories, informal and formal. 

Informal Assessment relies on (a) teachers’ observations of children's 
varying skills in different areas, which may be recorded in what are called 
anecdotal records; and (b) teacher-constructed tests designed to determine 
whether a child has learned what is being taught. 

Formal Assessment relies on tests developed by test publishers. These 
may includes achievement tests to measure academic attainment 
intelligence tests to estimate level of ability, and parent interviews to obtain 
information about social skills, language, presonality, creativity, physical 
ability, vocational interest, an other tests may also be warranted. 

Most of the tests available from test publishers are standardized. 
This means the test has been given to a large number of people under 
identical conditions: all people received the same instructions and had the 
same amount of time to complete the test. Also, allthe tests have been 
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scored the same way. According to test publishers, the test must be 
administered and scored precisely as the directions indicate for the results and 
to be useful. 


1.1 Norm-and Criterion-referenced Tests 


Assessment of child's progress has followed two basic monitoring systems, 
each with variations as required by characteristics of the population (e.g. test 
adaptations for blind children), theoretical perspective of staff (e.g. 
behavioural, enrichment), and curriculum package (e.g., Portage Guide to 
Early Education Checklist, Developmental Activities). A norm-referenced 
assessment system is traditionally included in nearly all programmes for 
young handicapped children (Bricker, Sheehan, & Littman, 1981). While 
widespread use of norm-referenced assessment measures in special 
education is a carry over from regular education, Criterion-referenced 
assessment has been used for more extensively in special education, 

It is helpful to make the general distinction between criterion-referenced 
and norm-referenced tests. This distinction was originally drawn by Glaser 
(1963) and has been discussed by Ward (1970), Moxley (1974) and many 
others. 


1.11 Norm-referenced Assssment , 

Norm-referenced assessment is designed to identify individual 
differences. Norm-referenced tests are these which compare a particular 
Student's performance to that of the norm group, the group of people on which 
the test was standardized. For example, a score at the 2.0 level in reading 
on an achievement test indicates that the child reads about as well as most 
children at the beginning of the second grade. Most of the achievement tests 
you took in school are norm-referenced tests. 

The item in norm-refrenced tests are not likely to be those included in 
an instructional programme, so the results have little relevance for the 
educatior. Mesurement is twice a year for purpose of determining pre-post 
differences, less frequent if cthese data are not required. A psychologist or 
educational diagnostician is more likely than the teacher to administer these 
tests, thus making this assessment even less relevant for curriculum and 
intervention. 

Early childhood educators have used norm-referenced testing to measure 
Child's progress. Achievement tests have been widely used in measuring gains 
in school-aged populations; however, there are no equivalent measures for 
infants, toddlers, or preschool children. Several forms of testing have 
emerged to fill this gap. Standardized mental ability tests for infants and 
preschool children have become common pre-post test measures of child gains 
resulting from intervention and have been used for programme evaluation 
purposes. The Bayley scales of Infant Development (Bayley, 1977) and the 
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McCarthy scales of children’s abilities (McCarthy, 1972) are commonly used 
instruments (Bricker, Sheehan & Littman, 1981; Sheehan and Gallagher, 
1983). In addition to mental age equivalency comparisons, researchers 
(Rosen-Morris & Sitkei, 1981) are using raw data pre-post test comparisons 
to show child’s progress. 


1.12 Criterion-Referenced Assessment 


When a child's performance on а domain-referenced instrument is 
compared to a preset criterion or standard for mastery it is referred to as 
a criterion-referenced test (Martuza, 1977). A criterion-referenced test does 
not compare one child's performance with that of other children. Instead, 
the child's performance is compared to some standard, called a criterion. For 
example, a teacher giving a multiplication test оп the nines tables is 
interested only in whether a child can correctly multiply the numbers 0 
through 10 by 9. The criterion would be 100 percent correct responses. 
Criterion-referenced tests are used to determine whether a child can perform 
a particular task, and not how well her performance compares to other 
children's. Criterion-referenced tests are especially useful in determining a 
student's readiness for the next level or sequence of instruction or to prescribe 
а remedial loop for instructional objectives when the test date do not indicate 
a mastery. The process ofdetermining mastery and readines for specific 
arease ofinstruction is also facilitated by the use of criterion-referenced 
testing procedures. 


Not only is the criterion-referenced assesment mere closely tied to 
curriculum than norm-referenced assessment, it may even be anonymous 
with curriculum if a prescribed instructional programme accompanies the 
criterion test. The Uniform performance Assessment System (White, Edgar, 
Haring, Aggleck, Hayden & Bendersky, 1980), the Student Progress 
Record (Oregon State Mental Health Division, 1977), and the Pennsylvania 
Training Model (Somerton-Fair & Turner, 1979) are comprehensive 
assessment systems that list literally hundrds of tasks. Although many 
teachers who give these tests take the failed items as objectives and design 
instructional programmes to meet the objectives, the test developers have not 
previded such plans. On the other hand, curriculum-referenced tests provide 
instructional plans forteaching each task moted in the measure. The West 
Virginia Assessment and Tracking System (Cone, 1981) and the Hawaii Early 
Learning Profile (Furune, O'Reilly, Hosaka, Inatsuka, Allman, & Zeisloft, 1979) 
are examples of curriculum-referenced measures. 

Criterion-referenced instruments, whether the curriculum is included or 
not, provide information for the instructional programme. The skills of each 
child are assessed, and progress is measured by a comparison only with the 
child's prior performance. Progress on criterion measures is usually charted 
daily for immediate feedback. Thus these measures provide formative 
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evaluation data to programme managers. Gain scores are expressed in terms 
of percentage of skills gained ot the raw number of skills achieved on pre- 
tests and post-tests. The Teaching Research Handicapped children's 
programme by the Joint Dissemination and Review panel to present child's 
progress data that were based on the accomplishment of objectives selected 
from a set curriculum. The panel is responsible for the evaluation of 
ducational programmes. Although this type of measurement system had 
been widely praticed in programmes for the handicapped, it had never been 
widely practiced in programmes for the handicapped, it had never been 
approved before as a viable means for documenting effectiveness. This 
action represented a landmark decision in the evaluation history of 
programmes for handicapped children and provided a model for the 
measurement of child's progress and programme effectiveness. It is 
anticipated that other programmes serving severly impaired children will pursue 
this procedure to document accountability. 


1.13 Salient Distinctive Features 


Norm-referenced tests depend on relative rankings between individuals 
while criterion-referenced tests are based upon mastery of a specified 
performance on a particular task. The selection of a norm-referenced 
measure depends on the existence and maintenance of variability between 
individuals, whilet that of a criterion-rferenced measure essentially books 
for variability in the environment and emphasizes similarities rather than 
differences between individuals. It is evident then that criterion-referenced 
tests will be sensitive to the differences produced by an_ instructional 
technique. Such tests can be designed so that they are dificult when 
administered before the training and easy afterwards. If the objectives are 
defined in terms of the instructional material actually employed, then such a 
system guarantees successful attainment as long as the individual child 
is eventually capable of learning with that system. Formative evaluation is 
used for diagnostic purposes and criterion-referenced tests are used for 
summative evaluation to determine mastery. Mastery learning is an approacjh 
to education which emphasizes the individualization of instruction with the aim 
of enabling the vast majority of students to achieve mastery of the educational 
objectives. 

The differences between norm апа criterion-referenced 
interpretation of test results is seen in the following example. On the Revised 
Standard Binet Intelligence Test, a child is asked to build a bridge using three 
cubes. This information is pooled with information from other test activities, 
such as recalling a series of digits, telling the meaning of orally presented 
words, and assembling objects. The child's performance on all of these takes 
then is reduced to a total score, which is converted to a standard score using 
norm-referenced tables. This standard scors, the IQ, estimates the child's 
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status in relation th his/her age mates on the construct “intelligence.”.This 
is а norm-referenced interpretation of a test's results. This process does not 
consider the child's proficiency in bridge building itself or in performing any other 
specific items. Yet, if one accepted that the Binet items represent the skills 
interpreted in intelligence, and if one wanted to teach "intelligence", then 
knowing and analyzing the students’ performance on each test item (the 
criterion-referenced interpretation) could help a teacher plan an intervention 
programme to teach "intelligence". 


In this section, we have seenthatthe assessment of most exceptional 
children requires the use of formal and informal techniques, standardized 
and teacher-made tests, and norm-and criterion-referenced tests. Children 
are assessed for two primary purposess: (a) for identification, to determine who 
needs special education services; and (b) for teching, to determine what and 
how a child should be taught. 


2. CONSTRUCTION OF CRITERION-REFERENCED TESTS 


Many a time the classroom teacher will need to develop a test for use 
with a particular student or in a specific performance area. The development 
of a criterion-referenced test by a teacher depends on that teacher's ability 
to follow these steps; (a) Clearly define the content or performance domain 
to be tests. (b) Sequence the comonent parts of the performance from its 
initial response or component to its final response or component part or a 
criterion for an acceptable final or terminal response. 

In same cases where the content domain has numerous component 
parts (e.g. gross major skills), the teacher may select a representative 
sample of performance items for assessment. The teacher must be careful 
that the items selected represent allaspects of the domain; if they do not, 
assessment will be an accurate. For the most part, this type of teacher-mde 
criterion-referenced test will suffics for instructional programming and/or 


sequencing in content domains with relatively small or readily defined 
parameters as in the following: specific letter or shape recognition; discrete 
physical motor performances; or spcific aspects of social interaction. 


3. DEVELOPING THE INSTRUCTIONAL PROGRAMME 


Criterion-referenced testing in the schools has become widespread in recent 
years. Itis a popular method because the test results are directly applicable 
lo planning instructional programmes individually for handicapped children. 
The process of translating assessment data into instructional programmes is 
most readily accomplished when the assessment data is obtained through 
direct observation or criterion-rferenced precedures. The use of these 
procedures directly asists the process of instructional programme development 
for essentially three reasons. 


(a) The data is collected within the context of the handicapped students's 
learning environment. 

(b) The data is collected most often by the teacher who will be developing 
the instructional programme and using it with the handicapped student. 

(c) The data collected reflects the observable behaviours or skills the 
handicapped student has acquired. 

Heise (1977), in discussing instructional programming for the learning 
disabled (LD), indicates that “the more nearly the assessment of learning 
approximates the context in which the learning problem was found, the 
greater the degree of sucess in matching instructional planning to instructional 
assessment. 


Current educational practice emphasizes the identification of skills in 
which students are deficient. This is done with skill based or criterion- 
referenced tests because they are used to identify the child's skill deficits, 
and because a student's performance is compared to a predetermined 
criterion. The test measures the academic subskills normally taught in a 
certain subject (reading or arithmeatic, etc.) at a particular grade level, and 
determines whether each student has met a certain criterion level (usually 80 
percent or better) indicating mastery of that skill. These who have not met the 
criterion are given additional instruction and then retested. Effective teachers 
generally assess children regularly, not just at fixed times of the year such 
as at the beginning and end of the grading period. It is best to adopt the 
"test-teach-test" principle, according to which a concept is assessed, taught 
and then reassessed. if the concept has been learned, the teacher can move 
on. If not, the concept is taught again-perhaps using different methods after 
the teacher has determined why the child did not learn during the first 
instructional sequence. 

Once a student's skill deficits have been identified, the teacher can plan 
lessons to help the student master these skills. The first step in lesson 
planning is to determine an instructional objective that describes exacttly 
what the student needs to learn. This objective should be based directly on 
the items failed on the test. For example, if the child had problems with two- 
digit addition problems involving carrying, the objective would be for the 
child corretly to compute two-digit addition problems with carrying to the 
tens column. If the student had trrouble reading words that began with the 
initial ^Cl"-blend, the objective would be to the child to read correctly 
words that began with the Cl-blend. This information could be translated 
directly and easily into an individualized instruction programms іп the 
“reading” of sound blend. To obtain such an inventory, resource teachers 
can construct their own tests in which all possible blends are represented 
by different test items. This kind of specificity helps the teacher translate test 
results into remedial programme. Resource teachers need highly specific 
information about reading, spelling, arithmetic, and language if they are 
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to plan an effective intervention programme. To obtain this information, they 
need the criterion-referenced interpretation of tests that have detailed and 
inclusive content. 

Most classroom and resource teachers construct, administer, and score 
many criterion-referenced tests during the school year. For example, a teacher 
assigns ten spelling words on Monday and then gives a spelling test on 
Saturday, requiring a child to spell at least eight words correctly in order to 
pass. This teacer has constructed a criterion-referenced test. By using this 
spelling test to identify words that the child spelled incorrectly, the teacher 
can prepare a programme of study. The teacher should, however, remember 
that the criteria in many classrooms are inappropriate for individual 
students. For example, five words spelled correctly might be a more 
reasonable expectation than eight. 


Criterion-testing interprets a child's performance on each test item rather 
than on the test as a whole. In criterion-testing, the examiner is not 
particularly concerned with a child's relative standing to his classroom 
peers or to a national representative sample of age mates; the examiner 
is actuely interested in the child's performance on the individual items. Good 
test items include all or most of the elements of the skill being assessed. By 
analysing the child's errors on such a test, the examiner can discern the 
specific subskills on which the child need to work and also can see sdditional 
areas in which the child should be asked other types of tests. 

Teachers frequently design own skill-based, criterion-referenced tests by 
determining what subskills are used in each academic area at a certain grade 
level and then developing test items to measure the student's achievement of 
these skills. There are also a number of commercially developed tests on the 
market. The major skill-based esting and teaching programmes in the area of 
reading instruction have been reviewed by Rude (1974). Many of these 
Programmes dscribe remedial activities for each of the subskills tested. 


The skill-referenced eduction approach is appropriate for disabled 
readers because it presents tasks designed to help resolve specific 
learning problems. Hartman & Hartman (1973) suggest that remedial 
programmes that stress lower level skills, such as eye-hand coordination, are 
less efective because the skills learned as a result of training may not be 
transferred to academic tasks. Another advantage of the skill-referenced 
approach is that it can be used by teachers who have little time available for 
work with individual students. During a twenty-minute Session, a teacher 
could give a remedial lesson to a student who, for example, confuses words 
having similar visual patterns. The teacher could offer direct instruction, such 
as practice differentiating similar words, or indirect instruction, such as 
exercises in differentitating forms such as circles and Squares. The teacher 
could pretest the child, and present low-level instruction only if necessary. 
In any case, low level instruction would have to be followed with direct 
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instruction using words. 

Some research indicates that effectiveness of the skill-based approach 
depends on the way it is used by the eacher (Morsink & Otto, 1977). Face- 
to face instruction is more effective than the use of worksheets that students 
complete independently, Teachers who select instructional activities carefully, 
provide enough practice to ensure skill mastery, and show students how to 
apply the skills they learn, find that their students retain information and 
generalize better. 

Another problem with the purely skill-based aproach is that it's too simple. 
It suggestes to some critics that all the teacher needs to do is identify a 
child's skill deficits and present skill-based instruction. If this were true, it 
would imply that learning disabilities are no more than unlearned skills, which 
would suggest that all children who have failed to master the basic skills 
should be identified as LD. This has happened in some places, resulting in 
almost half of a school's population being referred for diagnosis of learning 
disabilities. If the insistence on clearly determined neurological impairment 
resulted in the identification of too few LD children, the assumpion that all skill 
deficits are learning disabilities has resulted in the identification of too many. 
Identifying skill needs and providing appropriate instruction is an important part 
of he treatment of LD children, but these techniques should be based on 
careful assessment procedures and used in conjunction with other methos 
when necessary. 

Resource teachers can choose from many suitable criterion-referenced 
checklists and inventories. There are hundreds of skill checklists in reading 
and behaviour; fewer in arithmetic and language; and even fewer in 
spelling and handwriting. Examples of these checklists can be obtained 
from standard curriculum guides or from Smith (1968) and Hammill & 
Bartel (1978). However, it is usually better for the resource teahers to make up 
their own tests. Teacher-made tests can be criterioned to particular 
instructional programmes, specially constructed to meet the needs of an 
individual student, and designed to serve areas for which no ready made tests 
or lists are available. 


4. RATING PERFORMANCE PROCEDURES 

The list of component parts for a performance, skill, or response area must 
be combined with some form of rating system on which to record mastery of 
each subskill. If a thorough analysis of the response has been made and if 
the component parts are accurately sequenced from initial to terminal 
response, the classroom teacher can use a dichotomous rating system of 
occurrence or non-occurence (i.e. Yes—no or observed — not observed). This 
type of rating system helps to eliminate human error and subjectivity in rating 
the occurrence of a response along a scale (i.e. excellent, very good, good, 
average, poor, etc.). The teacher should use a scale or continum procedure 
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to rate a response whenever the domain or performance area has a 
qualitative aspect in its criterion for acceptable terminal behaviour (e.g. 
articulation of consonant, and vowel sounds, legibility of handwriting). 

Once the performance domain's parameters have been identified, the 
teacher then must specify the major component parts that constitute the 
entire performance domain. Many teachers have found, by carrying out the 
response or performance themselves and carefully noting each identifiable 
step leading towards the criterion response, that they can then construct 
a useful instrument for direct observational assessment of performance. 
Another practice commonly used by classroom teachers is to observe a 
student who possesses the response or performance skill and note the 
identifiable steps leading towards the response. 


5. USE OF COMMERCIALLY AVAILABLE CRITERION-REFERENCED 
TESTS 

The teacher of the handicapped often finds it necessary to develop 
new assessment procedures rather than use commercially available tests. 
This is due to the high degree of variability found in the instructional 
Sequencing of a classroomvor resource room. For example, in a class of ten 
student, it would be usualto have ten different levels of development exhibited 
across all instructional areas. Finding an appropriate assessment device to 
handle these variations would be a difficult task. However, some commercially 
available criterion-referenced tests can be used by the class-room teacher for 
either to screen a handicapped child's development and then conduct more 
specific and relevant assessment as indicated by the screening results. 

The following list of commercially available instruments does not in any 
way constitute a complete listing of all the tests but this compilation о! 
assessment devices has been found particularly useful for initial 
developmental screening of children exhibiting severe to moderate 
developmental retardation. these devices need very little training for use by 
classroom teacher. 

All these assessment instruments given іп the list have been developed 
for use in identifying the present adaptive behaviour and the academic, pre- 
academic and communication functioning of severely and moderately 
developmentally retarded individually. In some cases, the instruments were 
developed and standardized for use with specific groups of Individuals (e.g. 
School age children or ambulatory bur severely/profoundly retarded). In other 
cases, the devices constitute a developmental checklist to pinpoint 
functioning across a variety of performance domains. The teacher who knows 
how to use a number of these assessment devices with one student can 
gather instrutionally valuable information as well as a rather broad-based 
estimate of present developmental functioning. 
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LIST OF COMMERCIALLY AVAILABLE ASSESSMENT INSTRUMENTS 
Name: Bathazar Scales of Adaptive Behaviour (1973) 
Author: Earl E. Balthazar 


Administered by; Classroom teachers, Supervisors and ward 
Personnel. 


Method of Assessment; Direct Observation 

Population for whom intended: Axulatory severely profoundly retarded 
Standardized: Yes f 

Components: Section l-Scales of Functional Independence 


Section 11-8 additional social scale categories describing coping 
behaviours. 


(B) Name: ‘AAMD Adaptive Behaviour Scale, 1974 Revision, Public 


School Version. 

Authors; Nadine Lambert, Myra Windmiller, Linda Cole, Richard 

Figueroa. 

Administrated by: Psychologist, trained teacher. 

Method of Assessment: Observation, Parental interview 

Population for whom intended; School aged children. 

Standardized: Yes 

Components: Partl- ^ Measures level of development in 
independent functioning, physical 
development, economic activity language 
development, number and time concepts, 
domestic activity, vocational activity, self- 
direction responsibility, and socialization. 

Part Il- Measures the presence of absence 

of maladaptive behaviours. 


(C) Name: AAMD adaptive Behaviour Scale, 1975 Revision 


(D 


Authors: Kazuo Nihira, Ray Foster, Max Shellhaas, 
Henry Leland, 
Administrated by: Psychologist, trained teacher 
Method of Assessment: Observation, Parental interview 
Population for whom intended; Mentally retarded 
children and adults, 
Standardized: Yes 
Components: Part | & Part Il (Same as in *B"). 
Name: The Callier Azusa Scale 
Author: Robert D. Stillman. 
Administrated by: Individuals thoroughly familier with 
the childs behaviour 
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Method of Assessment: Direct Observation 


Population for whom Intended: Deaf, blind and multi-handioapped 
children functioning below the 6 or 7 yera level. 


Standardized: No 

Components: 

(i) Motor Development 

(ii) Perceptual Abilities 

(iii) Daily living skills 

(iv) Language Development 
(v) Socialization 


within each area there are subscales made up of sequential steps 
describing developmental milestones. 


Name: Cain-Levine Social Competency Scale 
Authors: Leo F. Cain, Samuel Levenie, Freeman F. Elzey 


Administrated by: Interview with through knowledge of test items and 
strategies. 


Method of Assessment: Interview of a person with the child's behaviour 
Population for whom intended: Trainable mentally retarded children, 
age 5-13. 

Standardized: Yes 

Components: 44 items divided into 4 subscales: 

(i) Self-help 

(1) Initiative 

(iii) Social skills 

(iv) Communication 

Name: Behavioural characteristics progression (1973) 


Administrated by: Special Education teacher, team; child care 
workers 


Method of Assessment: Direct observation 


Popultion for whom intended: Mentally and behaviourally 
exceptional children. 


Standardized: No 
Components: The BCP consists of 2400 observable traits referred 


lo as behavioural characteristics, grouped into categories of 
behaviours called behaviour strands. 


Name: Camelot Behavioural Checklist 
Author: Ray W. Foster 
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Administered by: Classroom teacher, paraprofessional 


Method of Assessment: Person familiar with child fills out checklist 
from memory; or, direct observationa. 


Population for whom intended: Trainable youth 

Standardized: Yes 

Components: 399 behavioural descriptions grouped in 10 domains, 
arranged in order of difficulty. 

(i) Self-help 

(ii) Physical development 

(iii) Home duties 

(iv) Vocational behaviours 

(v) Economic behaviour 

(vi) Independent travel 

(vii) ^ Numerical skills 

(viii) ^ Communication skills 

(ix) Social behaviour 

(x) Responsibility 

Name: The Devereux Child Behaviour Rating Scale. 

Authors: George Spivack and Jules Spotts. 

Administered by: Parent nurse, child care worker, or other parent 
surrogate. 

Method of Assessment: Observation 

Population for whom intended: A typical children. Ages 8-12. 
Standardized: No Time 10-20 minutes 

Components: There are 17 behaviour factors measured by the test. 
The first ten are labeled behaviour competence factors and the last 
seven are subsumed under the label "behaviour control 

problems". 

Name; Devereux Elementary School behaviour Rating 

Scales. 

Authors: George Spiveck and Marshell Swift. 

Administered by: Teacher 

Standardized: No 

Method of Assessment: Observation 

Testing time: 10 minutes 

Population for whom intended: Elementry aged children with problem 
behaviours. 
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Components: There are 44 question which fall into the following 
Behaviour Factors. 

(i) Classroom disturbance 

(ii) Impatience 

(iii) Discespect-defiance 

(iv) External blame 
(v) Achievement anxiety 
(vi) External reliance 
(vii) ^ Comprehension 
(viii) ^ Inattentive-withdrawn 
(ix) Irrelevant responsiveness 
(x) Creative initiative 
(х) Needs closeness to teacher. 

There аге 3 additional items which did not fall into the above 
categories .They are; (a) Unable to change (b) quits (c) Slow work. 
Name: Devereux Adolescent Behaviour Rating Scale. 

Authors; George Spivack, Peter E. Haimes, Jules Spotts, 
Administered by: Clincians, rehabilitation counselors, nurses, 
research investigators, and parent. 

Method of Assessment: Observation Standadized: No 


Population for whom intended; Adolescents which fall into 12 behaviour 
factors, three Rational Clusters and 11 additional items. 


(к) Name: Fairview Behaviour Evaluation Battery, 1974. 


Authors: Robert T. Rose, Alan Boroskin, James S. Giampiccoio 
Administered by: Teacher, Parent, caretaker, ward Personnel 
Method of Assessment: The observer circles the number 
preceding the statement which best describes the individual's 
typical behaviour, not past performance or inferred potential. 
Population for whom intended: Designed for mildly, moderately, 
severaly and propoundly retarded individuals. 

Standardized Yes 

Components: The Fairview Behaviour Evaluation Battery consists 
of Five scales: 

@ The Fairview Developmental scale 

(ii) The Fairview Self-Help Scale 

(ii) ^ The Fairview Social Skills Scale 

(iv) The Fairview Language Evaluation Scale. 
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(у) The Fairview Problem Behaviour Record. 
(L) Name: Learning Accomplishment Profiles for the young Child (1974) 


Author: Edited by Anne R. Sanford. 

Administered by: Teachers 

Standardized: No 

Method of Assessment: Child is tested on item; also, direct 
observation. 

— for whom intended: Handicapped (at least trainable 


Components: A manual describing the LAP and a recording 
booklet. The LAP is designed to provide the teacher by the young 

child with a simple, criterion-referenced record of the 
child's existing skills. 


(M) Name: Pennsylvania Training Model, Individual Assessment Guido 


(N) 


Authors: Ellen Somerton and Keith Turner. 
Administered by: Classroom teacher 

Standardized: No 

Population for whom intended: Multiple-handicapped public school 


Components; 
() ER en. р 
areas). 


(ii) Competency checklists (specific assessment within aros.) 
(iii) Individual Prescriptive Planning shoet (detailed analysis of 


Method of Assessment: Individual testing 
Population for whom intended: Preschool age Children. 


Components: 
(i) ^ Language and speech 
(i) Cognitive skills 


(i) ^ Self-care skills 


(iv) Social skills 

(v) Gross Motor skills 
(vi) Fine Motor skills 
Name: The Preschool Profile 


Authors: Linda, L. Lynch & Mary Ruth O; Connor revised by Jill 
Colleen Gallagher. 


Administered by: Preschool teacher Standardized: No 
Method of Assessment: Observation and individual testing 
Population for whom intended: Preschool children 
Components: 

(i) Gross Motor skills 

(ii) Fine Motor skills 

(iii) Pre-Academic Skills 

(iv) self-help skills 

(v) Music, Art, and Story skills 

(vi) Soail and Play skills 

(vii) Understanding Language (Receptive Language Skills) 
(vii) ^^ Oral Language (expressive Language skills) 


(P) Name: MEMPHIS Instruments for Individual Programme Planning and 
Evaluation (Comprehensive Development Scale). 


Authors: Alton D. Quick, Thomas L. Little, A.N. Campbll. 
Administered by: Classroom teacher Standardixzed: No 
Method of Assessment: Direct observation 


Population for whom intended: Children with a developmntal age 
between 3 months and 5 years. 


Components: Developmental evaluation of 
(i) Personal-social skills 
(ii) Gross motor skills 
(iii) Fine motor skills 
(iv) Language skills 
(у) Perceptuo-Cognitive skill 

(Q) Name: Student Progress Record 
Administered by: Teacher standardized: No 
Method of Assessment: Direct Assessment 
Population for whom intended: Trainable Mentally Retarded. 
Components: 


(О 


= 
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(i) Social skills 
(ii) Receptive language 
(iii) Expressive language 


(iv) Reading 

(v) Writing 

(vi) Number concepts 
(vii) Money 

(vii) ^ Time 

(ix) Eating 


(x) Dressing 
(xi) Personal Hygiene 
(xii) Motor skills 
(xiii) ^ Physical fitness 
(В) Name: TARC (Topeka Association for Retarded Citizens) 
Assessment system. 
Authors: Wayne Sailor & Bonnie Jean Mix 
Administered by: Classroom teacher Standardized: Yes 
Method of Assessment: Observation 
Population for whom Intended: Handicapped 


Components: 

(i) Assessment in areas of skill development. 
(ii) Deriving instructional objectives 

(iii) Profiling : 


(iv) Curriculum selection (involves a computer retrievel system 
to match skill deficits with existing curricula). 


(S) Name: TMR Performance rofile for the Severely and Moderately 
Retarded (1970). 
Authors: A.J. Dinola, B.P. Kamivsky & A.E. Sternfeld 
Administerd by: Tacher/Parafrofessional Standardized: No 
Method of Assessment: Observation, memory 
Population whom intended: TMR 
Components: 
(i) Social Behaviour 
(ii) Self-care 
(iii) ommunication 
(iv) Basic knowledge 
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(у) Practical knowledge 
(vi) Body Usage 

(T) Name: UPAS (Uniform Performance Asessment System). 
Editor: Margaret Bendersky. 
Administered by: Teacher 
Method of Assessment: Test/Observation Procedure 
Standardized: No 


Components: 
(i) Pre-academic 
(ii) Communication 


(iii) Social/Self-help 
(iv) Gross-Motor 


CONCLUSION 

The process of assessment for handicapped children in classroom can 
often be made easier by using commecially available assessment 
instruments. The teacher can broaden the child's assessment considerably 
by using criterion-referenced tests. Only few teachers have the background or 
training necessary to design assessment devices. They can, however, 
construct instruments to assess performance in areas with relatively few 
component parts. They can confidently prescribe instructional programmes 
through a systematic use of both commercially available criterion 
referenced and teacher-made instruments. The data from direct observation 
recording methods used in conjunction with criterion-referenced assessment 
data further extends the teachers ability effectively and efficiently. the most 
obvious strength of criterion-referenced assessment data further extends 
the teacher's ability effectivity and efficiently. The most obvious strength of 
criterion+eferenced assessment is the direct linkage between evaluation and 
prescription. The major limitation of criterion-referenced assessment is that 
itis a totally task oriented device. it is used solely to evaluate a students 
mastery of specific tasks, and in doing so, helps formulate a process for 
instruction. 

While assessing handicapped children by criterion-referenced tests or 
direct observation, the teacher has responsibility in writing clinically and 
instructionally relevant assessment reports. Assessment is hardly useful 
for the development of a systematic instructional programme until the data is 
converted into a clearly written report. The report should not only indicate 
the performance but also the areas of instructional needs. In other areas, 
the educational clinical report should clearly highlight the handicapped child's 
proficienceis and deficiences. 
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Use of Criterion Referenced Tests in 
Curriculum Evaluation 


D. Brahadeeswaran 
К. Ramachandrachar 


ABSTRACT 


Criterion referenced tests provide the kind of test score 
information needed by teachers and curriculum evaluators to make 
a variety of individual and programmatic decisions arising in 
objetive-based curricular programmes. Domain sampling technique 
is useful to select a representative sample of the objectives of the 
curriculum. If objectives are written in very specific operational 
terms to describe a domain, and if items are then written to 
sample the behaviour in this domain, then this would fit the 
description of criterion referenced tests. The various steps 
involved in developing and using criterion referenced tests for 
evaluating the curriculum of an individual subject included in a 
course of study have been described in this paper. 


1. INTRODUCTION 

The best measure of curriculum effectiveness is the percentage of students 
achieving the objectives of the curriculum. 

Criterion Referenced Tests (CRTs) are more suited for assessing 
curriculum effectiveness than Norm-referenced tests. This paper describes 
the methodology to be followed in developing and using criterion-referenced 
tests for evaluating the urriculum of an individual subject (like Physics or 
Mathematics) included in a course of study. 


2.0 CURRICULUM EVALUATION 

2.1 Concept of Curriculum Evaluation: 

Curriculum evaluation can be defined as the collection and provision 
of evidences, on the basis of which decisions сап be taken about the 
effectiveness and educational value of curricula. 

Cronbach (1963) solidified an emerging direction for curiculum evaluation 
by advocating course improvement as its purpose. 

Curriculum evaluation is essentially concerned with judging the 
effectivenes of curricula through processes of measurement or valuing or 
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a combination of the two. 

2.2 Mapping Sentence Definition of Curriculum Evaluation: 

Lewy (1985, p. 198 and 1977, p. 30) has proposed a mapping sentence 
definition of curriculum evaluation which serves as a classification scheme 
of curriculum evaluation studies. An adapted version of this is presented in 
Figure 1. 


The mapping sentene contains three facets; the stage of the curriculum 
development; the component of the curriculum or the entity being evaluated; 
and the type of decision situation. Combining all the three aspects it is 
possible to describe curriculum evaluation in the form of a mapping sentence. 


The overall definition presented in the mapping sentence in Figure 1 
contains a variety of evaluation activities; their totality makes up the more 
general concept “curriculum evaluation”. This definition sugests that 
evaluation is the provision of information for the sake of facilitating decision 
making at various stages of curriculum development. This information 
may pertain to the programme as a complete entity or only to some of its 
components. 

The mapping sentence in its totality constitutes an overall inventory of 
decision situations which one may encounter during the proces of curriculum 
development. Each such decision situation may require the conducting of 
а short-duration focused evaluation study. 


Figure 1. Mapping Sentence Definition of Curriculum Evaluation 


A : Stages 
determination of 
Evaluation is the aims, Planning stage of 
provision of tryout , programme 
information at the field trial , development 
implementation, 


and quality control 


B : Entity 
Course content, 
instructional resources, 
Concerning and study material, forthe 
teachers guide methodological approaches, 


whole curriculum 
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: Decision Situati 
Sake of making selecting elements of 
decisions about modifying the programme- 
qualifying the use of 


Adapted from Lewy (1985 & 1977 


One may define a particular substudy in the process of curriculum 
evaluation by selecting a single line from the three facets appearing in the 
mapping sentence. 

The totality of all substudies concerned with a particular programme 
constitutes its evaluation. 

The mapping sentence summarizes the variety of evaluation studies that 
may be performed during the life cycle of any new programme. 
Nevertheless, one should be cautious not to over evaluate a programme. 
A great variety of evaluation foci have been mentioned here, not for the 
sake of encouraging the evaluator to utilize all of them in the context, of 
dealing with a single programme, but to provide a broad inventory, from which 
activities most relevant to answer crucial questions may be selected. 

2.3 Models of Curriculum Evaluation 

Formalizing a comples process such as curriculum evaluation into a model 
is very helpful. The function of a model in evaluation is to provide a conceptul 
frame-work or a rationale for designing evaluation studies. 

Variations in curriculum evaluation models result from differences 
in the purpose of evaluations, the types ofevaluations, the methodology 
used in the conduct of evaluation and the questions asked. 

As evaluation efforts sought curriculum improvement, researches 
acknowledged the importance of process variables. Scriven (1967) 
distinguished between "formative" evaluation, focusing upon implementation 
processes, and "summative" evaluation focusing upon outcomes. Stufflebeam 
91969) described formative elements in terms of "context", "input", and 
"process", but his interpretation of "product" can be associated with the 
summative approach. Stake (1967) added complexity to the conception 
of evaluation by highlighting three major variables subject to both 
descriptive and judgemental protrayal- “antecedants” "transactions" and 
“outcomes”. 

2.4 Curriculum Evaluation at the Macro and Micro levels 

curriculum evaluation can be carried out at two levels: Macro-level and 
Micro-level. the term ‘Macro-level evaluation’ refers to the evaluation of the 
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overall effectiveness of the whole programme of different subjects prescribed 
as curriculum for the entire course. ‘Micro level evaluation’ pertains to the 
detailed evaluation of the curriculum of an individual subject (say Chemistry 
or Mathematics) included in the programme of study for a particular course 
(say VIII standard). 

The main purpose in the ‘Micro-level evaluation’ of the curriculum of 
an individual subject is to determine the extent to which he objectives of the 
subject are achieved by the students. 

Many researchers (Davis et al. 1974 and Orlosky and Smith, 1978) have 
opined that the best measure of curriculum effectiveness is the percentage of 
students achieving the objectives of the curriculum. In order to identify the 
areas of weakeness in acurriculum the performance of students has to 
be analysed objective-wise. Such an analysis will provide valuable inputs to 
facilitate (i) learning of various topics (content-areas) and (ii) attainment of 
various abilities aimed by the curiculum. 


3.0 NEED FOR USING CRTs IN CURRICULUM EVALUATION 


One of the most commonly used measures of curriculum 
effectiveness is the student test score. Stake (1979) argues that achievement 
test means, based on group scores or other region wide summary scores 
do not provide valid measures for diagnosing curriculum weaknesses nor for 
initiating curriculum changes. 

According to Cronbach (1963) “to agglomerate many types of post course 
performance into a single score is a mistake, because failure to achieve one 
objective is masked by success in another diretion. Moreover, since a 
composite score embodies (and usually conceals) judgements about the 
importance of the various outcomes, only a report that treats the outcomes 
separately will be useful to curriculum evaluators”. 

Criterion Referenced Tests provide the kind of test score information 
needed to make a variety of individual and programmatic decisions 
arising in objectives-based instructonal programmes. The more traditional 
norm-referenced tests are considered less than ideal for providing the desired 
kind of test score information. 1 


3.1 Short Comings of Norm-referenced tests for Curriculum Evaluation 


According to Popham (1981, p. 26) “а norm-referenced test is used to 
ascertain an individual's status with respect to the performance of other 
individuals on that test". The emphasis of norm-referenced tests is one the 
relative interpretation, that is, the interpretation of an examinee's performance 
in relation to the performance of the examines in the normative sample. The 
interpretations are made absolutely for criterion referenced tests. 

For several obvious reasons researchers like Hambleton and Eignor (1978) 
consider  nore-referenced tests unsuitable for the measurement of 


62 


curriculum effectiveness. Because what is at issue is not whether one 
student's test performance is better than anothers but how well the stuent 
has mastered each of he objectives of the curriclum. 

Popham (1978) has identified three weaknesses of norm referenced tests 
for programme evaluation purposes. They are listed below. 

i) Since norm-referenced tests are often so general, they 
frequently fail to mesh satisfactorily with the curricular emphases 
of the programme being evaluated. When the match btween test 
content and programme content is low, we have nothing of 
value. 

ii) As norm-referenced tests are very general they do not provide 
specific cues for identifying the weak areas of the curriculum or 
for instructional amelioraton. The typical diffuseness of norm- 
referenced tests renders them largely useless for such 
improvement guidance. 

iii) The technical item-production and item-rfinement procedures 
employed in the deelopment of norm-referened tests tend to 
make such tests less sensitive to detecting instructioanl efects 
than their criterion-referenced counter parts. 

The purpose of norm-referenced tests is to compare an individual's 
performance to that of some reference group. Consequently, norm- 
eferenced tests consist of test items that contribute most to maximising test 
score variability, those contributing low variability are eliminated. Itis clear 
that items tapping concepts taught successfully by a great number ‘of teachers 
will contribute little to test score variability (most studentswill answer the items 
correctly) and will be eliminated, while the items measuring pure reasoning 
ability will have greater variability and will be retained. As a result of the 
process, the test begins to look les like an achievement test and more like an 
aptitude test. The process of item selection puts a distance between the 
curriculum of the educational programme and the tool used to evaluate it. The 
test would be sensitive to the aptitude of the individuals rather than the 
effectiveness of the instruction. If an instrument is to be sensitive to the 
learning process, its content must be very carefully matched to that of the 
programme. It is being said more and more (Hambleton and Eignor, 1978, 
p. 15) the norm-referenced test function like IQ tests. 

Hence it is essential to use criterion-referened tests for curriculum 


evaluation. 


4.0 DEFINITION OF CRITERION-REFERENCED TESTS 


A recent content analysis (Gray, 1978) of 57 dscriptions of criterion 
referencing revealed that it was not unusual for different authors to use the 
term differently. Nitko, (1983 р. 446) has observed that criterion-referencing 
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is їп а state of development and the more or less standard ways of 
criterion-referencing have not become firmly established. 

Pophem's (1975, р. 130) definition of criterion referenced test, given below 
has been preferred by many researches. 

“A criterion-referenced test is used to ascertain an individual's status with 
respect to a well defined behaviour domain”. 

The term “Criterion” in the phrase “Criterion-referenced test” refers to 
a behaviour domain. 

One of the major sources of confusion that prevailed during the last 20 
years in the definition of criterion-referenced test is over the word ‘criterion’. 

Popham (1981, p. 27) has clarified this confusion by distinguishing 
between two conceptions of criterion; viz. criterion-as-a-level conception 
and criterion-as-a-desired behaviour conception. when the term ‘criterion’ 
is used to signify a desired level of proficiency, it reflects criterion as a-level. 
When the term 'criterion' is used to signify the target behaviours themselves 
it reflects criterion-as-a-desired-behaviour. As observed by Popham (1981, 
р. 28) interpreting criterion as a level of examinee proficiency yields almost 
no dividends over traditional testing practices. In fact, by using that conception 
of criterion, one could magically transform, any norm-referened test into a 
criterion-referenced test merely by setting a specific proficiency level for the 
test. criterion-referenced tests will provide substantial educational pay off 
only if they provide a precise description of an examinee's status with 
respect to a clearly delimited domain of behaviours. 


4.1 The Concept of behaviour Domain 

The dictionary meaning of the term ‘domain’ is scope, field or province of 
thought or action. 

When referring to “domains” in connection with criterion referenced 
measurement, some educators may confuse this more recent application 
with the former taxonomy of educational objectives context. In criterion- 
referenced testing the term ‘domain’ refers to a much smaller class of 
behaviours (popham, 1975, p. 131). 

Hively et al. (1973) have stated that the concept of domain includes both 
(a) specific content area as well as (b) behaviours associated with this content. 


A domain is well-defined when both the person(s) developing the test and 
the person(s) using the test are clear about which categories of performance 
(or which kinds of tasks) are and which are not potential test items. Since the 
basic idea of criterion-referencing is to generalize from the few items that 
happen to be on the test to the broader domain of performance from which the 
test items were sampled, a well defined domainis a necessary condition for 
criterion referencing. 


5.0 STEPS INVOLVED IN THE DEVELOPMENT AND USE OF CRTs 
FOR CURRICULUM EVALUATION 

What is relevant in curriculum evaluation is how well an individual is 
performing with respect to each of the objectives of the curriculum. When the 
curriculum to be evaluated is vast, domain sampling procedure is used to arrive 
at a representative sample of the curriculum in terms of objectives. 
Criterion-referenced tests provida the kind of test score information needed 
to assess the extent of achievement of various objectives of the curriculum. the 
steps inolved in the development and use of CRTs for curriculum evaluation 
ae depicted in a flow diagram (Figure 2) 


1. Selection a representive sample of the objective of the curriculam. 
2s Specifying the domain of tasks. 


| 


3. Establishing cut-off score for mastery decisions 


4. Preparation of test items. 
5. Judging content validity 
6. Administration of the CRTs. 


7. Analysing the data and 
interpreting the findings. 


Fig. 2. Flow diagram of the various steps to be followed in the development 
and use of CRTs for curriculum evaluation. 
5.1 STEP-1: SELECTION OF A REPRESENTATIVE SAMPLE OF 
THE OBJECTIVES OF THE CURRICULUM. 

If the curriculum document does not contain the specific objectives 
of the curriculum, then the investigator has to prepare a comprehensive 
list of objectives based on the content details given in the curriculum 
document. The validity of the comprehensive list of objectives has to be 
established by collecting Jury opinion. 

Testing is always a matter of sampling objetives, since time and resources 
are always limited. 

For selecting а representative sample of objectives from the 
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comprehensive list of objectives of the curriculum ‘DOMAIN SAMPLING’ 
technique is used. the guidelines provided by Popham (1978) and Hively et 
al. (1973) are very useful in domain sampling reported here. 

Hively's (1973) concept of domain includes both (a) specific content area 
and (b) behaviours associated with this content. By carefully considering both 
these aspects of various domains inthe curriculum and also the instruction 
time (Popham, 1978) alloted for each topic in the curriculum, a representative 
sample of the objectives of the curriculum can be selected from the 
comprehensive list of objectives. While selecting a sample of objectives from 
each topic, it must be ensured that the highest ability aimed by the topic is 
represented in the sample of objectives selected. This is because ina 
given topic the objective representing the highest ability is the ultimate 
objective aimed by the curriculum and the other objectives in the lower 
categories serve as enabling objectives only. 


5.2 STEP-2: SPECIFYING THE DOMAIN OF TASKS 
Nitko (1983, p. 452) has stated that verbal statements of stimuli and 
responses in the domain can be used as the basis for specifying the class 
or domain of tasks. Curiculum Embedded Tests of IPI Mathematics (COX an 
Boston, 1967), Popham and Huek's criterion Referenced Tests (1969) and 
Harris and Stewart's Criterion Referenced Tests (1971) have used the 
behavioural objectives to delineate the domain of tasks. Popham (1978) has 
suggested that behavioural objectives, as usually stated are not sufficient to 
delineate the domain properly because they are too vague. To overcome this 
difficulty Hively et al (1973, p. 13) have suggested that the objectives of the 
curriculum are to be operationally defined. Each of the objectives should 
clearly state the characteristics of the stimuli and responses in the domain 
(as given in the examples listed below). This will facilitate construction of 
test items matching the specifications of the domain. 
Examples: 
i) Determine the atomic number and mass number of a product 
formed from a given isotopes as a result of emission of a specified 
number of alpha paticles. 


ii) Identify the anode, cathode and electrolyte to be used for a given 
electroplating process. 


5.3 STEP-3: ESTABLISHING CUT-OFF SCORE FOR MASTERY 
DECISIONS 

As the CRTs will be used to determine the percentage of students who 
have mastered each of the objectives of the curriculum, it is necesary to 
establish a cut-off score or passing standard. This is also called ‘Criterion 
Level’. As observed by Nitko (1983, p. 460) no entirely satisfactory procedure 
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exists for establishing a cut-off score to which all parties will agree because 
there are numerous factors to consider. 

Fixing a uniform criterion level for each and every domain takes no account 
of the relative importance of these different domains. According to Garvin 
(1971) if on the basis of logical analysis of the subject matter and the extent 
instructional system, the knowledge and skills are seen as fundamental or 
pre-requisite to future learning, then a high proficiency level is required. А 
lower cut-off score can be tolerated when the material is not seen as 
completing a necessary link in the development of some higher complex 
concept of skill. 

Following these guidelines provided by Garvin (1971). the cut-off scores 
for the CRTs can be established. For example in the case of CRTs with four 
items in each, a cut-off score of 3 (75%) may be fixed as the criterion level for 
classification of students as masters, That is, students whose score is equal 
lo or above 3 would be classified as masters. 

For CRTs with six items in each, a cut-off score of 4 (66.66%) may be 
fixed as the criterion level for classification of students as masters. 


5.4 STEP-4 : PREPARATION OF TEST ITEMS 


Items for each of the CRTs has to be prepaed matching the 
specifications laid in the operationally defined objectives. Allthe items ina 
particular CRT must be homogeneous in the sense that they test the same 
segment of content and behavioural ability represented by a single objective. 
To begin with, a large number of items i.e. atleast 25 percent more number of 
items than just necessary must be prepared. 


5.5 STEP-5 : JUDGING CONTENT VALIDITY 

The content validity of the items in each CRT is determined by Jury 
opinion. A panel consisting of five or Seven teachers can serve as members 
of the Jury. A copy of the items prepared for each of the CRTs and the 
respective domain specifications is given to each member of the Jury. they 
are requested to look for correspondence between what they judge each item 
to measure and the domain it purports to measure and rate each item's match 
with its domain specification on a scale of +1 (perfect Match), 0 (undecided) 
and -1 (Mismatch), Final selection of items is then based on the consensus 
judgement of the seven judges. Based on the feedback obtained from them, 
the necessary modifications are made in the test items. 


5.6 STEP-6: ADMINISTERING THE CRTs 
All the CRTs pertaining to a subject should not be administered in a single 
shot at the end of the year. Itis desirable to administer the CRTs pertaining 
to each of the units in a subject as soon as the particular unit of instruction is 
completed. Forthis purpose a subject may be divided into convenient units 
by clustering related topics. 
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First а PILOT STUDY has to be conducted by administering the test to a 
small sample of students. Using the data of the Pilot study item analysis has 
to be carried out. (For details refer Ved Prakash's paper titled “Evaluating 
qulity of criterion referenced test items” in this Book), Based on the results 
of item analysis necessary modifications in the items constituting the criterion 
Referenced Tests have to be made. 


The CRTs refined can be administered to the entire sample of the study. 


5.7 STEP-7: ANALYSING THE DATA & INTERPRETING THE FINDINGS 


To determine the extent to which the objectives of the curriculum are 
achieved by the students, the peformance of individual students in each 
CRT is considered. In each CRT, students who have secured a score 
equal to or above the cut off point specified are classified as ‘masters’, 


While analysing the data of dataof CRT, itis necessary to estimate the 
reliability of mastery classification decisions. Subkoviak (1976) has proposed 
a single test administration method of obtaining an estimate of the proportion 
of students in a group that are consistently asigned to the same mastery 
state. The single administration group coefficient of agreement estimate 
proposed by Subkoviak an be computed as the estimate of reliability for 
each CRT. 

For evaluating the extent to which the various objectives of the curriculum 
have been mastered by students, the following procedure is used. An 
objective which has been mastered by a pre-specified percentage of students 
(say 60 of 70 or 80%) is considered a mastered objective. By applying this 
criterion the curricular objectives which have not been mastered are 


To get an insight into the nature of objectives which have not been 
mastered, these objectives have to be considered first in relation to the topic 
they represent and then in relation to the ability they represent. 


6.0 CONCLUDING REMARKS 


For objective-wise asessmenit of the effectiveness of a curriculum criterion 
referenced tests are highly useful. The more traditional Norm referenced 
tests are considered unsuitable forthe above purpose, because what is at 
issue is not whether one student's test performance is better than another's 
but how well the student has mastered each of the objectives of the 
curriculum. When he curriculum to be evaluated is vast, domain sampling 
techique can be used to select a representative sample of the curricular 
objectives. The step-wise procedure to be followed in developing and using 
criterion referenced tests for curriculum evaluation, described in this paper, 
provides the practical 'know how'to teachers. itis hoped that criterion 
referenced tests will be increasingly used in future for evaluation of the 
curricula of various courses. 
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Unresolved Issues in Criterion 
Referenced Testing 


Hari Kesh Singh 
ABSTRACT 


Criterion referenced testing is a positive technique of 
assessing educational outcomes destined to be attained after well 
defined and predetermined pedagogical efforts. Many articles and 
research papers have appeared on the but no initial analysis has 
ben made in regard to the issues of criterion referenced testing 
which have remained unresolved so far. Here in this analytical 
paper, an attempt has been made to enumerate the various 
unresolved issues of criterion referenced testing. The attempt in this 
regard does not only confine to highlight the unresolved issues 
rather viable solutions, which remedy the issues, have also been 
suggested. 

The major unresolved issues which have been enlisted 
after a thorough and rigorous analysis of the literature on criterion 
referenced testing in general, and Indian context of educational 
measurements and evaluation in particular are dicotomy, поп- 
linearity, overformalization, illogical homogeneitey, unvalidatedness, 
lack of replications, fallacious specifications, ^ unjustified 
equivalence of the synonyms of constructs and procedural 
inconsistencies. Logically апа psychologically tangible 
alternatives have been proposed beginning from philosophical 
prudence of content specification to the validation of criteria and 
evaluation of the internal potency of such measurement in tune with 
the laid down objectives. 


1.0 Educational research methodologists and semantists have connoted 
‘measurement’ ‘evaluation’ and ‘assessment’ with varied concepts. A general 
consensus has emerged where it is considered that the process concerned with 
‘qualification’ of any phenomenon is ‘measurement’ and the consequent 
decision made on ‘quantified data’ is termed as ‘evaluation’. Quaniity is 
measured and quality is evaluated. Thus measurement is supposed to be 
precondition of evaluation. Measurement preceded evaluation. Criterion- 
referenced testing is measurement in its ‘process’ and evaluation in its 
‘purpose’. The scope of the criterion-referenced testing has ben frequently 
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delimited to the school contexts but it does have the potentiality to be applied 
to other human enterprises of behavioural modification. 

1.1 The distinction between measurement and evaluation has been 
clarified by Dubois, Alverson and Staley. They write, “As with any 
assessment process, the evaluation of entering behaviour involves the 
collection and evaluation of data. Psychologists working in the field of tests 
and measurements use the term measurement to refer to the collection 
portion of the process. According to Stevens (1951, p. 1)1, “in its broadest 
sense, measurement is the assignment of numerals to objects, or events, 
according to rules (1951, p.1 2. We measure height and weight following 
certain rules and then assign some numerical value to the measurements. 
We do not asign numbers in all cases of measurement, especially when 
using criterion-referened measuring instruments. Here the symbols asssigned 
may be equivalent to + or -, since the measuring instrument sets a single 
standard and the individual either meets or fails to meet the absolute standard 
set by the objective. When evaluating data we go beyond the concept of 
measurement and take a judgement about the measurements taken. The 
judgements can be in terms of either a norm-referenced standard or а 
criterion referenced standard (1979, p. 152)3. 

1.2 Many research methodologists admit that criterion-referenced tests 
are more useful than others in regard to facilitate the student learning. Glaser 
and Klaus mention the 'tests designed to measure the achievement of specific 
performance objectives are referred to as criterion-referenced tests. The 
type of information obtained from а criterion-referenced measure is more 
useful than information obtained from other types of measurement for 
instructional purposes (1962)'4. 

1.3 Psychometricians believe that criterion-referenced testing possesses 
the characteristics which are conducive to call it an objective approach of 
measurement and evaluation. The sophisticated analysis of the claims 
of the advocates ої criterion referenced testing reveal that criterion- 
referenced testing, despite its purpose-oriented inability, does possess some 
questionable procedural inconsistencies. These inconsistencies are 
issues in themselves and have been raised in some form or the other by the 
expert psychometricians. The three major components of any learning 
objective are Behaviours, Conditions and Standards, which are often 
taken into consideration while constructing a criterion referenced test. The 
procedural inconsistencies occur either in these domains or in the basic 
assumption of adopting this approach of evaluation as challanged by who 
lists who believe that learning cannot be assessed in terms of fragmented 
behavioural components. 

2.0 Before proposing the unresolved issues in criterion-referenced 
testing, it seems more pertinent to enumerate the positives of this sort of 
lesting especially in educational premises and the corresponding manifest 
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assumptions behind each positive perspective. The salient positive 
perspectives are as follows. 
24 Criterion-referenced testing is designed to measure performance 
objectives in instructional situations. 
22 Criterion-referenced testing is more useful than any other form of 
testing. 
23 Criterion-reference testing can help teacher to assess the feedback 
and its appropriate use for bettering the instructional strategies. 


24 Criterion-referenced testing can be employed in the forms of pretest, 
formative and summative evaluation. 

25 Criterion-referenced testing requires categorical and conceptual clarity 
in regard to making the statement of the performance objectives. 


3.0 These positives perspective provide prospect to the criterion-referenced 
tesis but there are certain inherent issues pertaining to criterion-referenced 
testing which have remained unresolved. These unresolved issues belong to 
the element of justification of the criterion-referenced tests, statement of 
justification of the criterion-referenced tests, statement of the performance 
objectives, test construction, ensuring the characterised parameters of 
the tests, internal potency of the test with reference to outcome, and nature 
of feedback available from the test. however, these issues have been put 
here in quite symptomatic manner. It further requires description of each issue 
which seems unresolved from different angles of analyses. All Psychometric 
tests in general and criterion-referenced tests in particular are criticised 
because these tests fail to satisfy the philosophical querries. The basic purpose 
of any measurement or evaluation should be its utility for broader context of 
human society as far as possible in the real life situation with satisfactory 
reliability and validity. Now the determinants of the goodness of a test are 
explicitly clear. Any inconsistency prevalent in criterion-referenced testing 
may b termed as ‘unresolved issue’ which ought to an can be remedied 
provided these unresolved issues are attacked with prudence and tackled 
methodically. the complex but remediable unresolved issues in criterion- 
referenced testing are, 

3.1 the criterion-referenced testing generally does not aim at interrelating 
the school-based objectives with societal goals. 115 nature is 
dichotomotous in regard to educational objectives and educational 
goals. 

3.2 the criterion-referenced testing does not ensure logical linearity 
among students’ schools and society's domains of educational 
objectives. Thus non-linearity occurs in such testing, 

3.3 Overformalization of the conditions of learning further aggravates the 
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problem of generalizability of the outcomes of criterion-referenced 
testing. 

3.4 illogical homogeneity is presumed even for heterogeneous 
groups of learners without sufficient pretest scrutiny for homogeneity. 


3.5 Criterion once judged by one expert or by a group of experts is often 
taken granted valid without processing it for validation. 


3.6 Replicational testings are not reported so as to ensure that the 
process based administrational fallacies (errors) may be minimised. 


3.7 In criterion-referenced testing the linearity of stated objective, 
corresponding behaviour and appropriated item is not tested through 
any psychometric techniques. 

3.8 The limits and typology of specifications are still undertermined. 


3.9 The ‘constructs’ used sometimes synonymously with criterion 
referenced testing such as domain referenced, cloud referenced, 
mastery tests, competency tests, basic skills tests etc. Have not 
been properly operationalized and semantically justified. 


3.10 The inconsistencies and consistencies at three different stages 
of Criterion referenced testing are not psychometrically and statistically 
ascertained. Atthe stage of content specification, four steps are 
taken into consideration which are:- 


Description, Sample directions and test item, Content limits, and 
Response limits. These steps have been proposed by Hambleton 
(1978, p. 1-47)5. At the test development level 12 steps have 
been proposed by Hambletion and others. Theses steps are; 


3.10.1 Preliminary considerations 

3.10.2 Review of competency statements 
3.10.3 Item Writing 

3.10.4 Assessment of Content Validity 
3.10.5 Revisions to test items. 

3.10.6 Field test administration 

3.10.7 Revision to test item 

3.10.8 Test assembly 

3.10.9 Selection of a standarsd 

3.10.10 Pilot test administration 

3.10.11 Preparation of manuals and 
3.10.12 Additional technical data collection 


Here the inconsistencies if any stems at any stage are not checked 
and curbed accordingly because of the lack of any alternate but 
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equivalent testing device of the same specified objectives. 

4.0 These issues enumerated here have been pointed out by different 
authorities of measurement and evaluation. It will not be off the scope of this 
paper if the views of some pioneer psychometricians are citd here to 
corroborate our observations about the unresolved issues of criterion 
referenced testing. 

4.1 Du Bois, Alverson, & Staley refer to Tyler's observation and they are 
of the opinion that ‘most teachers, and most schools for that matter, do not 
attempt a fent such as developing a complete and explicit statement of their 
educational philosophy, yet is an important step in the derivation of valid 
objectives’ (1979, p. 206-207). 

4.2 Tyler provides clue to resolve the aforesaid issue and writes “For a 
statement of philosophy to serve most helpfully as а set of standards or a 
screen in selecting objectives, it needs to be stated clearly and for the main 
points the implications for educational objectives may need to be spelled out. 
Such a clear and analytical statement can then be used by examinig every 
proposed objective and noting whether the objective is in harmony with one or 
more main points in the philosophy, is in opposition or is unrelated to any of 
these points. These in harmony with the philosophy will be identified as 
important objectives’ (1950, p. 24). But evidently no literature on testing 
provides any technique of determining the ‘harmony index’. If this ‘harmony 
index' is attempted and resolved, much perplexing ‘subjectivity’ can be 
encountered at this beginning stage of test development. 


4.3 Illogicality of the behavioural statement of the educational objectives 
has been raised by Ebel when he states ‘To consider the student 
performance specified in a perforamance objective as the goal or objective 
of education is misleading and inaccurate (1979, р. 208)9. Eisner makes 
the point that expressive objectives build upon performance objectives, but 
that most frequently the major goals of education are best described as 
expressive objectives. Ап educational experience as described by an 
expressive objective not only affects each student differently but can have a 
multitude of effects on a given student. Therefore, itis premature and even 
foolish for us to think that the effects of an experience that we can specify and 
measure are the most important ones. We cannot even be sure of that we 
are cognizant of the most important effects (1967, p. 250-260). this observation 
further deepens the doublt about the basic assumption of the testing. 
However, it is an extreme case of criticism. Atkin also thinks in the same 
Stream and asserts that the goal of education should come from philosophy and 
needs; just because an objective is capable of being measured does not 
make it an important objectives '? Ebel again seconds Atkin’s view and 
writes "simply stating that something is an objective does not make it a 
desirable (1970, p. 171-173)". Similar objections are raised when incidental 
and international learnings are discussed in reference to criterion referenced 
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testing. In spite of all the criticisma made by reqputed authorities of testing, 
there is a group of positivist psychometricians and research methodologists 
who encounter the weaknesses (inconsistencies) of the criterion-referenced 
testing by putting forth the logical and Psychological explanations. These are 
Popham, Husek, Glaser, Hambleton, Berk, Roid, Haladyna and few others. 
The history of the development of criterion referenced testing tells that after the 
publication of Glaser's first article on criterion referenced testing since 1963, 
over 700 papers were publishd since then to 1985 and about 57 definitions of 
criterion referenced measurement were offered. The most acceptable 
definition is supposed to be of Popham who has positively connoted that 
‘criterion referenced tests are constructed to permit the interpretation of 
examinee test performance in relation to a set of well-defined competencies 
(1978)! 

4.5 lt is worth retaining that criterion referenced testing possesses 
relevance in the areas of measurement and evaluation. The earlier discussed 
manifest inconsistencies may he removed. Some efforts have been made 
in regard to procedural inconsistencies. Berk'®(1985, p. 1118) rightly says 
that some measures have been suggested to ensure the reliability of such 
tests. He refers to Hambleton and Novick's method for estimating poand 
Huynh's method for estimating po and k and Livingston's K? (x, Tx). 


5.0 Concluding Remarks 

Conclusively it is now evident that some issues are still unresolved 
and deserve philosophical, psychological, communicational and semantic 
conceptualisation and rectification. Statistical inconsistencies have been 
overcome and are in the process of being overcome by the authorities in this 
field, The unresolved issues should not be considered ‘unresolvable ones 
and, therefore, there is need of re-assessing and reprocessing the efficacy 
and applicability of criterion referenced testing which in turn will help humanists 
in general to give viable just order to the society and to educationists in 
particular to restructure and re-plan education. 
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SECTION-II 
DEVELOPING CRITERION 
REFERENCED TESTS 


Use of Norm-Referenced And 
Criterion Referenced Measurement 
for Classroom Testing 


B.N. Sujatha Reddy 


ABSTRACT 


In norm-referenced measurement an  examinee's 
performance is interpreted interms of the relative position held in 
some known group where as in criterion-referenced measurement 
an examinee’s performance is interpreted interms of his mastery of 
the content in a specified domain of instructionally relevant tasks. 
Forthese two types of measurement and interpretation to be 
meaningful and useful, evaluation instruments have to be 
specifically designed for the type of interpretation to be made. 
Hence criterion-referenced test differs from a norm-referenced test 
wit respect to many of the characteristics of a measuring 
instrument like planning the test, item-analysis, reliability and validity 
though the differneces in most casés are matters of degree 
rather than kind. The merits and limitations of norm-referenced 
measurement and criterion-referenced measurement indicate that 
both have a place in educational testing programme. 


1. INTRODUCTION 


Assessment in education is a multifaceted process. It refers to "the 
collection and evaluation of data involving inputs to, transactions within, and 
outputs from an educational sytem”. (Payne, 1974, p.3). It includes the ap- 
praisal of all the processes and products which describe the nature and ex- 
tent of a pupil's learning. The assessment data provides information to 
teachers, administrators, parents, pupils and others about the educational 
system and serve as basis for a variety of individual and programmatic de- 
cisions like placement decisions, Classification decisions, programme im- 
provement decisions. The effectiveness of the instructional/educational sys- 
tem depends toa lare extent on the comprehensiveness and the quality of 
the assessment data on which the decisions are made. 

A comprehensive assessment of a pupil's progress towards all of the 
important outcomes of instruction depends on a variety of measurement 
tools and evaluation procedures. These may be classified and described 
in different ways, depending on the frame of reference used. One way of 
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Classifying and describing them is in terms of how the results are interpreted. 
There are two basic ways of interpreting pupil performance in tests and on 
other evaluation instruments. They are Norm-Referenced Criterion-Referenced. 


2. Criterion-fererenced measurement versus Norm-referenced 
Measurement 


2.1 Norm-Referenced Measurement 


In a norm-referenced measurement frame work "an examinee's 
performance is evaluated relative to the performance of others in some well- 
defined comparison or norm group". (Mertuza, 1977, p.6). using class as 
a reference group pupil A's performance can be described as follows 


Student A did better than 70% of his classmates in a test on 
addition of numbers. Student A has understood the basic concept in 
an unit better than 90% of his classmates. 


Student A ranks abova 20% of the class 

In a norm-referenced interpretation one does not describe what 
percentage of the test items on addition of numbers pupil 'A' answered 
correctly, but simply what percent of the pupils in the class (norm group) he 
surpassed. 


2.2 Criterion-Referenced Measurement 


їп a criterion-referenced measurement frame work, the student's 
performance is described in terms of his mastery of the content in a well- 
defined content domain or his achievement with respect to an explicit instruc- 
tional objective by comparing his test score to some preset standard or cri- 
terion for success (Martuza, 1977). Criterion-referenced interpretation 
enables one “to describe what an individual can do, without reference to 
the performance of others." (Gronlund, 1977, p. 18). In criterion-referenced 
measurement a student’s performance is interpreted by comparing it to 
some specified behavioural criterion of proficiency. 

eg. Student A got 70% of items correct in a test on addition of numbers. 
Student A has understood 80% of the basic concepts in an unit. 
Student A can type 40 words per minute with error, 


To polarise the distinction between norm-referenced measurement 
and criterion-referenced measurement, it can be said that the focus of a 
norm-referenced measurement, score is on, how many of a criterion-refer- 
enced measurement score is on, what is it that student A can do. (Mehrens, 
1973). 
2.3 Uses of Norm-referenced measurement and criterion-referenced 
measurement for a classroom teacher 

Norm-referenced measurement and criterion-referenced measure- 
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ment are useful for a classroom teacher in her instructional endeavour. 
They serve the four basic uses of evaluation* (Gronlund, 1977) in class- 
room instruction. However, criterion-referenced instruments are appropri- 
ate for formative and diagnostic evaluation and norm-referenced instru- 
ments for summative evaluation. Placement evaluation is likely to require 
both criterion-referenced tests (to described possession of prerequisite skill) 
and norm-referenced tests (to determine level of erformance for advanced 
placement) 


* 1) Evaluation of pupil entry behavious in a sequence of instruction 
(Placement Evalution) 

2) Evaluation of pupil progress during instruction (Formative evalu- 

ation) 

3) Evaluation of pupil learning difficulties during instruction (Diagnos- 

tic evaluation) 

4) Evaluation of pupil achievement at the end of instruction (Sum- 

mative evaluation). 

The two distinct types of (criterion and norm-referenced) meas- 
urement and interpretation are most meaningful and useful when the evalu- 
ation instruments are specifically designed for the type of interpretation to be 
made. Hence the need for the development and use of norm-referenced 
tests and criterion-referenced tests. 

Pointing out the limitations of norm-reference measuremtn which led 
to the development of criterion referenced-measurement, a comparison of 
the norm-referenced test апа criterion-referenced test with respect to the 
different aspects of the test, and the limitations of criterion-referenced meas- 
urement is given in the following pages. 


3. Development of criterion-referenced measurement 


3.1 Background : 

The distinction between norm-referenced measurement and criterion- 
referenced measurement is not of recent origin. As early as 1918 Thorndike 
observed that "There are two somewhat distinct grops of educational meas- 
urements : one — asks primarily how well a pupil performs a certain uniform 
task; the other—asks primarily how hard a task a pupil can perform with 
substantial perfection, or with some other specified degree of success. The 
former are allied to the so called the method of "right and wrong cases 
(Criterion-referenced)". Each of these groups of methods has its advantages, 
and each deserves extension and refinement though the latter seems to 
represent the type which will prevail if education follows the course of 
development of the physical sciences". However educational measurement 
did not develop along the lines of the physical sciences, but adopted a psy- 
chological model based on the concept of individual differences which 
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continues to dominate educational testing even today (Martuza, 1977, р. 
391). 


3.2 Limitations of Norm-referenced measurement 


From early seventies educators have shown a growing interest in 
criterion-referenced measurement and have turned their attention to devel- 
oping procedures for building criterion-referenced tests and expanding their 
application to a wide variety of curriculum areas. The criterion referenced 
measurement movement is in part a reaction to the misuse of psychomet- 
ric methods which were developed for assessing individual's aptitudes and 
abilities which differentiate the individuals at all points along the continuum. 
The same classifcal measurement theory has been carried over to achieve- 
ment testing. But there are times, when itis not intnded to differentiate in- 
dividuals for their achievement, but to find out whether they have 
achieved a specific set of objectives. In such instances norm-referenced test 
is not an appropriate one. 

Stressing this point and also pointing out some other trends Martuza 
(1977) in his book 'Applying Norm-Referenced and Criterion-Referenced 
and Criterion-Referenced Measurement in Education' lists the following which 
gave rise to the criterion-referenced measurement. 

1) growing criticism of standardised norm-referenced tests of achieve- 
ment and ability, which do not help pupils to check their progress 
toward the attainment of a certain skill or knowledge. 

2) the governing controvery about grades which is closely related to the 
first. Critics argue that fight for good grades engenders a competitive 
ethic, emphasizing "winning" the good grade race at the expense of 
the true purpose of education. Further a grade of A or B does not tell 
anything about what a learner can do, but only whether he is supe- 
rior or inferior to some vaguely defiined reference group. 

3). growth of instructional technology movement and the failure of norm- 
referenced tests to meet the instructional technology needs in evalu- 
ating either individual performance or the efficacy of alternative 
instructional strategies 

4) the assumption that most children can attain а given performance 
standard, which underlies such approaches to instruction as In- 
dividually prescribed Instruction’ ‘Mastery learning’ which need the 
use of criterion-referenced measures, both in the formative and 
summative stages. 

Pointing out the limitations of norm-referenced measurement 
elsewhere in his book Martuza writes- ‘educators have become increasingly 
aware of the limitations of norm-referenced procedures for (a) diagnos- 
ing student achievement deficiencies, (Б) assessing the level of a student's 
knowledge wthin a well defined content area, (c) evaluating the effects of a 
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curriculum change on student achievement, (d) assessing the strengths and 
weaknesses of school programme within a particular school district”. 
(Martuza, 1977, p. 6). 

Thus the recent support for criterion-referenced measurement has 
mainly originated from the emphass on behavioural objectives, the sequenc- 
ing and: individualization of instruction, the development of programmed 
materials, the growing emphasis on mastery learning, failure of norm- 
referenced measurement procedures to help the teachers meaningfully in 
their classroom instruction and the possible danger of norm-referenced 
testing promoting unhealthy competition and injuring the self-concept of low 
scoring students. 


4. Norm-Referenced Tests and Criterion-Referenced Tests. 


.1 Meaning 

4.1.1, Norm-Referenced Test :-Norm-referenced tests are “designed to 
rank students in order of achievement, from high to low, so that decisions 
based on relative achievement (eg. selection, grouping, grading) can be made 
with greater confidence (Gronlund, 1976.) 

A norm-referenced test is one designed “to measure the growth in a stu- 
dent's attainment and to compare his level of attainment with the levels rached 
by other students and norm group”. (Bormuth, 1970,) 

4.1.2. Criterion-Referenced Measurement:- There is no single agreed 
upon difinition of a criterion-referenced test. Nitko (1983) observes that crite- 
rion-referencing is in a state of development and standard ways of critérion-ref- 
erencing has not become firmly established. Hambleton et.al (1978) point out 
that there are almost as many ideas about what a criterion-referenced test as 
there are contributors to the field. Given below are some of the definitions of 
a criterion-referenced test. 

"Criterion-referenced test is one that is deliberately constructed to yield 
measurements that are directly interpretable interms of specified performance 
standard" (Nitko, 1983, p. 446). Ivens (1970, p.2) defined a criterion-referenced 
test as "one consisting of items keyed to a set of behavioural objectives". A 
criterion-referenced test according to Popham is one which "ascertains an 
individual's status with respect to a well-defined behaviour domain". 

A criterion-referenced test was defined by Harris and Stewart (1971) as 
one consisting of a sample of production tasks drawn from well defined popu- 
lation of performance. 

Although these definitions differ widely in verbal terms yet all of them 
emphasise : (a) test-organisation based on specific tasks or behavioural ob- 
jectives, (b) assessment nterms of predefined performance criteria (Pritam 
Singh, 1983, p.3). 

Of all the definitions/explanations of a criterion-referenced test Popham's 
definition has been preferred by many. The author of the present paper also 
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favours the definition by Popham. 
4.2 Concepts that are not equivalent to Criterion-referencing (Nitko 1983). 


Some concepts which are often confused with criterion-referencing are 
discussed here in order to clarify the kinds of interpretations that criterion- 
referencing implies. 


4.2.1, One of the major sources of confusion is the word ‘criterion’ 
in criterion-referenced test. For Many it refers to a ‘passing score’ or ‘cut-off 
Score’ or a ‘minimum proficiency level’. But the most influential criterion-refer- 
encing papers cf Glazer (1963), Popham and Husek (1969) make it clear that 
the work ‘criterion’ to in a criterion-referenced test, refers to to a domain of 
behaviours and in criterion-referencing one is interested in referencing an 
examinee's test performance to a well defined domain of behaviour measur- 
ing an objective or skill. 


Popham (1981, p. 27) has clarified this confusion by distinguishing be- 
tween two conceptions of criterion-criterion as a level and criterion as a desired 
behaviour. when the term ‘criterion’ is used to signify a desired level of profi- 
ciency, it reflects criterion as a level. When the 'criterion' is used to signify the 
target behaviours themselves it reflects criterion as desired behaviour. Further 
Popham (1981, p. 28) points out that interpreting oriterion as a level of exam- 
inee's proficiency yields almost no dividends over traditional testing practices. 
In fact, by using that conception of criterion, one could magically transform any 
norm-referenced test into a criterion-referenced test merely by setting a spe- 
cific proficiency level for the test, criterion-referenced tests will provide sub- 
description of an examinee's status with respect to a clearly delimited domain 
of behaviour. 


4.2.2. The Criterion measure confusion : 


Another usage of the word ‘criterion’ that is sometimes confused with 
criterion referencing is criterion-related validity. Gronlund (1976) distinguishes 
the two as follows-The word 'criterion' in criterion-referenced testing refers to 
the type of behviour (as described in the instructioal objectives) that the test 
Score represents ‘criterion’ in criterion-related validity refers to some second 
measures of performance that the test scores ae to predict or estimate. 


4.3 Concepts closely related to criterion-referenced test. 

Currently ther is confusion about the differences among there closely 
related kinds of tests-criterion-referenced test, Domain-referenced test and 
objective-referenced test. 

4.3.1 Domain-referenced test is on that is built so that scores on it can 
be referenced to a well defined domain of behaviours in a way that permits an 
ехатіпое'ѕ status on that domain to be estimated (Nitko 1983, р. 457). Nitko 
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observes that there is little difference between thid broad notion of domain- 
referencing and criterion-referencing. As Hambleton etal (1978, p. 3) point out- 
if Popham.s definition of criterion-referenced test is adopted, there is no 
essential difference between a criterion-referenced test and Domain-referenced 
test. 

4.3.2. Objective Referenced Test: Objective-referenced tests have be- 
havioural objectives associated with specific test items (Nitko 1983, p. 457), 
Hamselton et. al (1978, p. 3) define objective-referenced tests as follows: “In 
a CRT the items are a representative set of items from a clearly defined domain 
of behaviour measuring an objective, where as within a ORT no domain of 
behaviour is specified and items are not considered to be representative of any 
behaviour domain". Therefore, objectie-referenced tests may or may not be 
criterion-referenced. (for further details interested persons can go through Nitko 
(1983), Hamoleton et. al. (1978), Martuza (1977), Pritam Singh (1983), Sally 
Brown (1981). 

4.4 The above explanations of nom-referenced test and criterion- 
referenced test make it clear that discriminating individual differences is the 
major emphasis of norm-referenced test; but they are of no concern in criteion- 
referenced test. On the other hand, criterion-referencd test emphasizes what 
competencies students do and do not possess or their status with respect to 
specified performance standards. In norm-referenced test the concern is: How 
the following situations. Mark 'A' for asexual and 'S' for sexual reproduction. 

(i) Onion bulb developing into a plant 

(ii) Mango seed developing into a plant 

(iil) Cutting of rose developing into a plant 

(iv) Piece of ginger with a bud developng into a plant. 

But item 2 requires much greater understanding than one. Hence a 
Score of 85% on the second type of items (difficult) may represent superior 
mastery. Obviously such criterion for mastery as 85% of the items on the 
criterion-referenced test is arbitrary, since item difficulty is arbitary. Therefore, 
it is difficult to base criterion-referenced measurement om meaningful criteria 
of achievement. 

Chase (1974) lists the following limitations of riterion-referened tests- 

1. Criterion-referenced test tells only whether a child has reached pro- 
ficiency in a task area, but does not show how good or poor is the 
child's level of ability. 

2. criterion-referenced Tests have general validity problems because 
the criteria and the tasks presented on Criterion-Referecned Tests may 
be highly influenced by a given teacher's interests or biases. 

3. Another validity question is how well Criterion-Refernced Tests can 
represent the multiplicity of objecties of different subject areas. Only 
some areas readily lend themselves for listing specific behavioural ob- 
jectives aound which Criterion-Referenced Tests can be built. This may 
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be a constricting condition for teachers. 

4. Criterion-Referenced Tests are necessary for only a small fraction of 
important educational achievements. But promotion and assessment 
of diversity of skills is an important function of the school and it requires 
norm-referenced testing. 


In addition to the above criticisms Hopkins and Stanley (1981) have 
raised the following questions with reference to Criterion-Referenced Test. 
What is mastery? How does one logically establish an absolute standard or 
criterion of mastery? How does one justify a criterion of say 80% as the cut off 
score for mastery of some concept or domain? When does one have proper 
understanding? how does one establish a minimum level of competence in 
reading, spelling, listening-and the like. (These questions have to be answered 
by further research). 

Thus, because of the non-availability of definitive means of establish- 
ing performance/standards and other limitations atleast initially Criterion-Ref- 
erenced Tests should be used with certain reservations. 


6. Conclusion 


The foregoing discussion on the pros and cons of criterion-referened 
and norm-referened measurement and their procedures indicates that there is 
a place for both in educational testing. As Mehrens (1973)suggests-the way 
most schools are currently organized with time of instruction constant for all 
individuals and degree of learning the variable, discrimination testing should be 
prevalent. However, as more individualized instructional processes are used 
and as more is learned about how various subject matters should be se- 
quenced, mastery testing may increase in importance. 

In the Indian context, there seems to be more emphsis on norm-ref- 
erenced measurement in the evaluation of students. The idea of criterion- 
referenced measurement is yet to catch up in Indian schools. As there is more 
emphasis on terminal examination which is based on norm-referenced meas- 
urement our teachers/evaluators do not favour criterion-referenced measure- 
ment. However we should make a beginning especially at the elementary stage 
where mastery of basic skills is of atmost importance and criterion-referenced 
measurement is definitely useful to the teacher in testing the basic skills. Hence 
there is a need for reorienting our teachers and teacher educators towards 
criterion-referenced measurement and its importance in educational evaluation. 
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Table 1 - Comparison of Norm-referenced tests 
with Criterion-referred tests 


Test 


Characteri 
stic 
Purpose 


Planning 
the test 


Prepring 
the test 
items 


Type or 
test items 


Level of 
difficulty 
of items 


Norm-referenced tests 


To measure students’ 
achievement and compre it 
with the levels reached by 
other students; helps to 
describe pupil performance in 
terms of the relative position 
held in some known group. 


The test is typically planned 
on the basis of general 
descriptions of subject-matter 
topics and process skills (of 
late emphasis is given to 
specific behavioural objectives 
in planning the test). 


Test items are constructed to 
produce maximal variability in 
performance across students. 


All types are used. 


Test items are of medium dif- 
ficulty, i.e., about thirty to 
seventy per cent of the pupils 
answer correctly. —Dificulty 
index from about 0.3 to 0.7; 
Item difficulty localised around 
5095. 
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Criterion-referenced tests 


To measure  students' 
achievement and to com- 
pare each individual's per- 
formance to a specified 
performance standard; 
helps to describe pupil per- 
formance standard; helps 
to describe pupil perfor- 
mance in terms of a 
Specified domain of in- 
structionally relevant tasks. 


The test is typically 
planned inerms of specific, 
behaviourally stated objec- 
tives, each providing the 
basis for one or several re- 
lated test items. 


Test items are constructed 
to represent the domain of 
the objective and to 
measure proficiency for a 
specified task. 


All types are used. 


Difficulty of the items 
depends on the nature of 
the specific learning tasks 
to be measured. Hence 
vary widely in difficulty, but 
regularly of the mastery 
type ie. most pupils 
respond correctly when the 
instruction has been effec- 
tive. 


Items 
selection 


Criterion 
for 
mastery 


Method 
or 
reporting 
results 


Нет difficulty and the dis- 
criminating power of the item 
are used as criteria. 
Deliberate attempts are made 
to eliminate very easy and 
very difficult items to obtain a 
wide spread of scores. 


No criterion for mastery is 
customarily specified. 


Percentiles or standard scores 
are employed. 


Interpreting A total score or several sub- 


the 
result 


Validity 


Scores are computed and a 
pupil's relative standing in a 
group or class is ascertained. 
In case of standardized tests, 
norms are used supplied by 
the publisher to interpret the 
tests. 


Content, construct and 
criterion-related validity are 
appropriate. 
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Items are selected on the 
basis of how well they 
reflect the specific learning 
tasks. No attempt is made 
to modify item difficulty, or 
{о eliminate easy items 
from the test, in order to 
obtain a range of scores. 


Objectives for criterion-ref- 
erenced test items desig- 
nate a criterion for mastery 
or it sometimes can be in- 
ferred. 


Apt to employ "percent 
correct’ scores. 


A pupil's sucess or failure 
on a test item or small 
group of similar test items 
is determined and a state- 
ment is prepared that 
describes his performance 
solely with reference to 
certain performance objec- 
tives, unlike norm-refer- 
enced tests which are 
provided with norms, 
criterion-referenced tests 
requre one to set one's 
own cut of for "adequate 
proficiency" to interpret the 
results. 


Descriptive, Functional and 
Domain-Selection validity 
аге appropriate (See 
Pophan 1975) 


Reliability 


Common 
uses 


Availability 


Availability 
of 
standaridiz 
ed tests. 


Classical estimates of 
reliability-Test-retest, 
Equivalent form, split-half and 
Kuder-Richerdson Method 
(whichever method is ар- 
propriate and suitable) are 
typically used. 


1)To determine the 


normative group. 


2)To assist in 


To provide a more or less 
3)global reprsentation of pupil 
in a specified То 


achievement 
learning area. 


Tests are available 


Tests are avilable covering 
most of the curriculum areas. 
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relative 
standing of a pupil within a 


selection of 
individuals for educa- tional 
programmes, employment etc. 


Classical estimates о! 
reliability are not complete- 
ly appropriate. But the con- 
cept of а precise 
measure-one that the 
small standard error of 
measurement is still impor- 


tant. There have been 
various attempts to 
develop statistical 


measures for estimating 
the reliability of criterion- 
referenced tests, but a 
salisfactory solution has 
not yet been achived. 


1)To ascertain a pupil's 

status with respect to 
established standards 
of performance, 
thereby determining 
what he known and 
can do. 


2)To evaluate pupil entry 
behaviour in a 
sequence of 
instruction, апа to 
diagnose learning 

3ylifficulties of pupils. 


evaluate the 
success of an 
instructional 
programme. 


Attempts are being made 
to develop criterion-refer- 
enced tests and an іп- 
creasing available, many 
of which focus on the basic 
skills. 
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Implications of Criterion Referenced 
Tests For Classroom Teaching And 
Testing 


V.N. Srivastava 
ABSTRACT 


This paper deals with the relevance of Criterion Referenced 
Testing for the improvement of teaching and learning. After listing 
the various steps involved in Criterion Referenced teaching-testing 
model, the relationship between programmed leaming and Creiation 
referenced testing is identified and its use for providing remedial 
measures is highlighted, so as to achieve mastery learning by the 
Students. Identification of domain elements as the basis of criteation 
referenced tests in educational guidance is emphasised. Focus of 
С.А.Т on students’ learning ad improvement of instruction is 
stressed. Self-learning and self-evaluation becomes quite invitable 
while useing criterion Referenced approach for teaching and testing. 
Curriculm improvement through analysis of results of C.R.T. is also 
reflected. The role of CRT in improving the whole teaching-learning 
process with a focus оп development of the individualinaccordance 
with the intended outcomes of learning is the focus. 


1. INTRODUCTION :- 


It is а common observation that there is a high percentage of failures in 
public examinations at the secondary stage. The main reason is that promotions 
from loer classes to higher classes are given without teaching the child to 
master the skills which are necessary to further their knowledge in the higher 
classes. Inthe monograph oncriterion-referenced Model of teaching Dr. Singh 
(1975) lists the following major attributes. In Criteation refeenced Model the 
students are :- 

(a) taught a learning unit for mastery of objectives formulated for that unit 
tested for mastery of those objectives. 

(b) allowed to proceed to new material if mastery is obtained 

(с) given remedial instruction if the material presented is not mastered 

(d) tested again after remedial work to check for mastery of material 

(e) allowed to proceed to the next unit if mastery is obtained after remedial 
instruction 
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(f) given further remedial instruction if mastery is not attained even after 
remediation 
(g) tested again for mastery of objectives. 


In case our schools adopt this method of teaching and testing much of 
the backwardness in school subjects will be removed right in the class wherehe 
ought to have mastered it. Large percentage of failures at the public examinations 
year after year some times more than 50% do indicated that our system of 
learning, teaching and evaluation do need a change because 75 per cent to 80 
per cent marks are expected to be obtained by at least 75 to 80% of the 
students population if mastery-learning approach is followed. 


2- PROGRAMME LEARNING AND CRITERION REFERENCED TESTING:- 


All the new approaches to learning aim at helping the child to master 
the skills, have the understanding and be in a position to apply the knowledge 
and the skills learned in very many situations. One must not proceed to learn 
multiplication till he has mastered addition and substraction and further one 
must not proceed to learn multipliation. It clearly states that there is learning 
hierachy and effective learning is not possible without permiting the child to 
mastering those skills which enable him to mature enough to develop further. 
Programme learning can be more effective if creterion-referenced approach to 
testing is adopted. 


3- DIAGNOSTIC EVALUATION :- 

When as teachers we say that a student X is weak in English it does not 
convey much meaning. As teachers we also want to know that which are those 
areas where the child fails - spelling of words, meaning, their use in sentences, 
verbal comprehencion, expression in written language, grammer etc. C.R.T. 
model can help us in diagnosing the specific areas in which the child is to be 
given remedial instruction. 


4- FIELD THEORETCICAL APPROACH IN LEARNING AND EFFICIENCY 
OF DOMAIN ELEMENTS :- 

Every learning situation is a field where the learner and the teacher are 
both influnencing each other. To make the field effective so that the child 
masters the knowledge which teacher wants to impart, it becomes necessary 
that grading and sequencing of concepts must be done properly, unless this 
is done in a systematic manner proper teaching vis a vis testing through 
criterion referenced tests is not feasible. We know that learning takes place 
from known to unknown from concrete to abstract etc. and law of readiness 
startooperating only when the learner is in a position to comprehend the 
learning materials to him. An effective learning is posible when the learner gets 
motivated because ofhis inputs of knowledge. ‘In fact use of criterion referenced 
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tests helps to establish this sequence when the results are analysed. Knowledge 
of hierachy of concepts and their attainability at various grade levels is of 
advantage in using such tests,’ 


5. EDUCATIONAL (GUIDANCE . AND USE OF C.R.T.:- 


At the Bureau of Psychology, Allahabad parents often come with the 
complaints thattheir wards are notdoing well insome subjects. The poor marks 
obtained by the words fail to indicate the real area of their weaknesses in that 
subject. If curricula of all classes from class II to class XII are analysed in all 
the subjects concept-wise, domain-wise, objective-wise and of course individual- 
wise and a proper heirachy is laid down, then both learning and teaching canbe 
more effective and criterion-referenced testing can then prove to be of great 
help in locating the areas of the backwardness. Concept-wise analysis would 
indicate the level at which a Particular concept has been learnt orcanbe learnt. 
Domain-wise analysis can reflect on the appropriatness of the domain description 
and definition. Objective-wise analysis would help to find out the level of 
attainments of various objectives. In case of students-wise analysis tho. 
identification of non-masters can form the basis of selection of appropriate 
correctives to be applied. Thus we see that the successful educational 
guidance in removing the difficulties of the learner is made possible throughthe 
effective use of C.R. testing. Validational norms for all such clases can be 
worked out having made the tests reliable. So for at the Bureau of Psychology 
we have been only predicting about the chances ofgetting success at H.S. 
Examination having tested the boysand girls onthe group tests of Intelligence 
and having known his personality make ир by using personality inventories. 
Butif our focus is turned on democratic testing itwill be our duty to enable every 
child to learn the subjects and master them. 


6- LEARNING-TEACHING IN REFERENCE TO GROWTH ORIENTED 


CONCEPTS :- 


We know that learning helps the child to mature physically, socially, 
emotionally and academically. We learn in order to grow. Dr. Mehrotra once 
observed that no student could escape learning under him, He referred to 
Skinner's operational coditioning. If learner is once put on the right learning 
track and society exercises the policing that the learner does not leave 1 e 
track, learning is bound to happen. Hull's work () on learning of framing 
equations input and output do suggest that we can do much to improve 
teaching resulting in effective learning. Griterion referenced testingattemptsto 
discover not only the indequaties in students’ learning but also uncovers the 
areas of difficulties and the cause of those inadequacies. It is often found that 
text books meant for a particular class do Not give sufficient example to clear 
а concept. Those who lay down the curriculum fail to examine minuntely 
whether the books provide the expeerience that is helpful for the learner. 
C.R.T. attempts to validate the concepts which are delineated in advance on 
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logical basis. Like-wise validity of specified domain objectives are also checked 
against empirical evidence that we obtain having adminstered C.R.T. tests and 
scored them. Self paced learning is allowed for slow learners. Remedial 
measures are taken on sound lines. In fact focus of these tests is on the 
improvement of students’ learning to the maximum level and deemphasise the 
traditional comprisons of labelling some of them as under-achivers, slow 
learners or diviants. The faith that the learner can achieve the goal is as much 
true for the teacher as for the learner. 


7- Self Learning and Self Evaluation 

Importance of cultivation of proper study habits in students at younger ages 
can not be ignored. | am reminded of Eklaryga of Mahabharat who was denied 
the help of teaching from Guru Dronacharya. But his aim to become a master 
of archery by self-learning andself-evaluation made him more a perfectlearner 
than Arjun whose learning was constantly guides, supervised and evaluated. 
Ultimate aim of C.R.T. is to promote self-evaluation, Stress on measurement 
of specified learning out comes in terms of what has been learnt and what is 
still to be learnt makes the learner cognisant of his strengths and weaknesses. 
As in behaviour therapy the changes and modifications made in the behaviour 
having been plotted on a graph paper at specified intervals motivate the patient 
tostilldo his best to attain the specified goal in a given time, so does the learner. 
With continuous feedback of adequacies of students’ learning develop a sense 
of success and recognition which motivates him for further learning that 
ultimately leads him to develop positive self-concept. If we can develop scales 
tomeasure the self-concept for these learners, we can get a very valuable data 
of how the interactions help him to grow not in learning only but in his 
personality adjustement also. Thus these types of tests do motivate the learner 
for self-evaluation thereby motivating him for self-improvement. 


8, CRITERION REFERENCED AND CURRICULUM 


Weoften see that certain courses are prescribed at lower stages in the 
schools of India when those very concepts were left out in western schools 
because they did no good to learners. None can deny that use of set-theory in 
smaller classes created more confusion in our schools than good to effective 
learners. Had C.R.T. been used in those days the framers of the new curricula 
would have been forced to rethink about putting set-theory course in lower 
classes. Unless the prescribed curricula are analysed in terms of sequential 
learning in the form of units which are arranged in order of their complexity 
development or sequential growth, it would be difficult for an ordinary teacher 
todevelop such tests and take advantage of their evidence. The failure of grasp 
the new topics added to science course in secondary schools of U.P. by 
‘majority of boys and girls of rural area is a clear proof. For each unit identified 
as major concepts have to be identified and arranged in order of their 
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complexity. These concepts ought to have been further spelt out into sub- 
concepts to delineate the scope of each domain of testing. Micro-teaching can 
then take care to explain these sub-concepts to those students who failed to 
grasp tham in first instance. Learning and teaching can be improved if these 
are criterion based and then alone criterion-reference testing can help us to find 
out the short-comings in the learner. Feed back is possible when we have 
specific evidence as revealed by C.R.T. scores. 


REFERENCES 


1, Bloom. B.S. ( Taxonomy of Educational Olyectines) 

2. Pophan W.J. (1975) Educational Evaluation Prentice Hall Inc. Englemood Cliffs, New 
Jersey. x 

3. Singh , Pritam (1953) Criterion Referenced Testing NCERT. New Delhi. 

4. Gronlund N.E. (1976) measurement and Evaluation in Teaching. Macmillan Publishing 


Co. Inc. New York. London. 


Constructing Criterion 
Referenced Tests 


Anand Bhushan 
ABSTRACT 


This paper is devoted mainly to the construction of criterion 
referenced tests for the out-comes of the cognitive areas. As the 
concept is not much popular and a well consolidated definition is 
yet to emerge, the construction is preceded by the presentation of 
a working definition criterion referenced tests. $ 

The construction proper has been discussed in four sections. 
the first section deals domain definition-its components, criterion 
and procedure of defining domain. The secnd section outlines the 
procedure of generating items. The third sections of the capter 
elaborate the two steps. At the first step, the stimulus homogeneity 
is determined and desirable measures for improving the same are 
discussed. The fourth section is devoted to validation of tests. It 
presents a set of workable suggestions for determining reliability 
and validity of criterion reference tests. 


1. DEFINITION 

One way of defining the concept of Criterion Referenced (CR) tests 
may be to differentiate it from the concept of Norm Referenced (NR) tests. 
The two types of tests can mainly be differentiated with respect to the mode 
of interpretation and the ue of test results. In the NR tests, an individual's 
performance is seen in relation to that of Norm group whereas in CR tests, it 
is seen in relation to some fixed criteria. The term criteria assumes a special 
significance as it is the reference point in the concept of Cr tests. 


The Usage of he term criteria, has added to the conceptual 
Confusion, because the term was in popular usage in the traditional 
psychometric field, where it indicates different sense in different context. In 
traditional measurement milieu, the term is used to denote the cut off point 
in a distribution of scores. Such criterion levels are used to separate the 
groups of examinees. In context of criterion related validity, the term refers 
to a variable of future behaviour which is predicted by the specific 
test-scores. (Brown 1970). Some writers deal with criterion, in terms of 
post-school outcome such as preparing for a future profession. Such remote 
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behaviours, they argue, should constitute the criteria to which the test 
should be referenced. This position in the study of criteria appears attractive 
but as such it helps litle to one who is engaged їп the business of 
developing evaluation tools. 


In the context of criterion referenced testing, unlike its usage in 
traditional measurement milieu, the term ‘criterion’ refers to assembly of 
behaviours called behaviour domain. It is in this context that “A 
criterion-referenced test is used to ascertain an individual's status with 
respect to a well defined behaviour domain" Pophan (1981). Here, a well 
defined behaviour domain is a key phrase, It would precisely refer to a class 
of behaviours. For instance, a behaviour such as "to multiply 10 and 29 
correctly" demonstrates a very limited behaviour and causes a practical 
problem. During any course of moderate length, thousands of such 
behaviours would be developed. The number of such behaviours would 
become unmanageably large. So, a behaviour domain would encompass a 
fairly large domain of homogeneous behaviours such as "to multiply two 
digit integers correctly. 


In examples of fairly large domain of behaviours such as (a) to 
discriminate between growth and change (b) to relate food and energy in 
living organisms and to xplain te process of photosynthesis etc., the 
behaviours have been provided contect by the corresponding specification 
of contents. In the examples cited above, "growth and change n living 
organisms" is a content which provides context to a behaviour "to 
discriminate”, An abstract behaviour such as "to relate" has been provided 
омех! by specifying content i.e. "growth and change in living organisms" 
and a behaviour "to explain" has been specified by the Content "proces of 
photosynthesis,” With the help of these examples, it can be understood that 
апу behaviour-domain even if, it does not make an explict mention of 
content, would be preciely defined only in the context of some content. As 
behaviours do not occur in abstract, content specification ís a pre-requisite 
lor a well defined behaviour domain. Concluding the discussion, the 
essence of criterion-referenced-tests may be precisely prsented (pophan 
1981) as follows; 

(i) ^ awellexplicated domain of behaviour be delineated, and 


(i) an individual's performance in relation to this behaviour domain be 
ascertained. All through the present chapter, the criterion 
referenced (CR) tests refer to an individual's performance with 
respect to a criteria that is a well defined behaviour domain. 


2. CONSTRUCTING A CRITERION REFERENCED TEST 


Some early writers believe that the technique of Construction of 
criterion referened tests is similar to that of Norm referenced tests and the 
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two testing modes simply differ in their purpose and use. It is almost agreed 
that the purpose of Norm referenced tests is selection and classification, 
whereas the purpose of criterion referenced tests is to identify the 
individual's learning deficiencies in a given course. 


In the construction of any test for selection, the main purpose is to 
develop items which are capable of discriminating among leamers. Here 
only those items are selected, which ensure some magnitude of variance in 
obtained sets of scores. The technology of Norm referenced tests 
construction is based on the assumption that a moderate amount of score 
variance exists in the test scores. 


The construction of Criterion-referenced tests has a altogether 
different focus. An adequate criterion referenced test is considered as one 
which identifies learnin deficiencies of those who took a course al a 
moderately effective level. In an ideal situation a CR Test may not 
neceessarily yield score-variance at all. Anyhow, variance may be present in 
some measure, but it is not a necessary attribute of a criterion reference 
tests. What is desired is (a) precisely defined behaviour domain (b) the 
quality of test-items that pinpointedly indicate the status of an individual with 
respect to a well defined behaviour domain. Like any measuring tool, the 
criterion referenced tests are constructed through a well sequenced 
interconnected sets of steps. The steps required for constructing a criterion 
referenced test are: 


2.1 Defining Domain 
2.2 Generating items 


2.3 Improving items 
2.4 Validating the tost items 


Each one of these steps has been briefly described in the following 
paragraphs: 1 


2.1 Defining Domain 

The first step їп constructing the criterion-referenced tests is to 
specity the behaviour domain that test items will measure and to which all 
individual's performance would be referenced. The domain definition is a 
complicated task and requires a number of considerations. 


2.11 The first and foemost consideration is how large a chunk of an 
individual's behaviour should be set out to circumscribe? Whether it should 
be more general or highly specific? There are equally compelling arguments 
favouring both the options. More general goals usually cover a fair amount 
of behaviour domain, so, are limited in number and are easy to monitor. On 
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the other hand, general goals are often stated so nebulously that they help 
little in test-construction. More precise goals are useful for developing tests 
but they beome unmanageably large in number and it becomes practically 
very difficult to keep track of all of them. There is a need of ground rules to 
decide how specific a domain description should be. However, to help 
deciding the egree of generality in respect of domain definition, three points 
for consideration have been suggested. 


2.11 (i) Instructional duration: The test-constructor may decide to 
cover domains that could be taught in a single period or instead focus on 
domains hat could be taught over months. Though instructional duration 
itself varies for a skill, ability or for an attitude and it also varies from man to 
man, but some studies reveal that domain developers should describe 
behaviours taking at least a week to promote. 


2.11 (ii) Limited priorities: One way of magnitude determination 
may be to fix the limit on the number of domains that may be incorporated 
into a test of moderate size. The behaviours that are to be developed during 
the course are identified and a set of priorities among them are determined. 
In domain description, all the priorities should be captured, given adequate 
weightage and be incorporated in the domains decided well in advance. 


2.11 (iii) Item-homogeneity: Another way to determine generality 
magnitude is to select as large a chunk of behaviour that would probably 
yield homogeneous items i.e. the items which would appear to perform 
same function with respect to their content and format. 


2.12 Content and Intended Learning outcomes: The second 
consideration that is important for domain definition, concerns the content 
and behaviour specification, All content is not identical, different types of 
content has different implicationas for instruction and testing. Among various 
attempts of content classification, one most research based hierarchy has 
been developed by Gagne. He suggested five categories such as 
intellectual skills, cognitive strategies, verbal information, motor skills and 
attitude. The first two of the list have been divided into seven subcategories 
of which the two most popularly used learning types are concepts and rules 
(principles). Bloom’s taxonomy of cognitive behvioures may be helpful for 
further specifying learning objctives at different levels. The two schemes of 
classification of content and intended learning outcome have of course 
gathered sufficient empirical evidence for their validity and use. But many 
teachers have found them too difficult to functionally use them in practice for 
domain specifications. A two dimensional table of the simplified derivation of 
the schemes has been found quite usefully and acceptable by the test 
developers. The same has been suggested as follows: 
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Intended Lg Knowledge Understand-| Higher Order 
Outcomes|ing (comprehension (Analysis,Synthesis 
Lg Type and Application) and evaluation 


Verbal knowledge 
Concepts 
Rules 


Motor skills 


Attitude 


However, it is not obligatory, it is a suggestion an can be further 
modified. 


2.13 Performance Criteria: Following the specification of intended 
learning out-omes another important consideration concerns. With the level 
of learning outcome that would be accepted as evidence sucess. In subjects 
where fixed dichotomy of right and wrong exist, performance standard in 
terms of proportion or percentage such as 85 per cent or 8 times out of 10 
may be stated as performance riteria. In subject where such dichotomy does 
not exist, right answer may be determined against some authority or in the 
absence of such opinions it may be determined againt the position taken by 
the class teacher-In certain situation where test developer really struck and 
do not find the way to verbalise the accepted behaviour, the statements of 
examples and non examples may be presented. 


2.14 Format Consideration: Usually, the question of format is not 
important, but in certain situations it becomes significant from the point of 
setting of stimuli or uniformity of responses. For examples, the recall of facts 
can be tested comfortably through fill in the blank or short answer type but 
organisation of ideas logical derivation through nothing short of well 


designed essay items. 


2.15 Components of domain description: The first thing that a 
domain should specify is a mix of three aspects forming the stimulus 
element. The one particularly important segment o the stimulus element is 
the statement of content which circumscribes the potential item. The second 
segment delimits the type of behaviour by specifying the direction to 
respond. And the third segment presents the sets of situation under which 
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the specific behaviours are proposed to be developed and consequently 
tested. 


The second component of a domain description is a mix of two 
aspects forming response elements: One of them is either the pattern of 
tight and alternative responses or the organisation or response. And another 
is the performance criteria. The major elements can be discriminated as 
illustrated below: 

Stimulus element Response element 


1. Give your assessment of the 
statement by marking a tick on 
True/False (T/F) written against it. 
Food enables living things 
to obtain energy. TIF 
тл es аа —— L „ЖАЫ 


2. Choose by marking a tick, the respons 


that BEST completes the statement, 
out of the four choices given 


against it. 
Green plants form the primary a. Decompose food into 
source of food because they simpler components. 


b. get carbon dioxide 
from the. 


c. make their own food. 


d. get water from 


the soil. 
the sol, 2. O 


For the purpose of domain definition, response element becomes 
particularly more significant in multiple choice itmes. In such items it is not 
only required to focus upon the right responses but also to equally 
concentrate upon distractors. In fact, it is the distractors which most 
effectively communicate. What is involved in a behaviour domain. What type 
of discriminations are required. The wrong answer alternatives provide 
information which benefit teachers. For example a teacher who knows as to 
what popular wrong discriminations, learners make, can better prepare 
learners to avoid such errors. 
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Finally regarding domain specification, the two guiding principles 
which deserve mention: are economy of description and another ambiguity 
reduction, While working of domain a middle position must be taken in 
between the two extremes of: 

i) Sufficient detail for complete stimulus homogeneity of resulting 
test items, and 
ii) Economy of resource investment. 

Although, a domain does not delimit all the possible test items, it 
markedly reduce the ambiguity associated with the class of learner 
behaviour uner consideration. 


2.2 Generating test items: 

A number of items might be'constructed for any given objective 
(Domain), even a highly specific objective (well defined domain) could have 
a potential item pool of well over several thousand items (Hively, 1970, 73, 
Bromuth, 1970). In terms of feasibility, a survey of the current measures 
revealed that the usual practice is to use about 3-5 items per objective. This 
practice, however, does not have any sound foundation in psychometric 
theory or technology (Klein and Kosecoff, 1976). 


The task of generating items requirs a mix of abilities of analysis 
and precision of sciences and the inventive and creative expression of arts. 
At this stage of test construction, the writer is required to invent more and 
more imaginative ways of creating situations. That are congruent with the 
specifications of domain definition. From the universe of natural behaviour 
or through the learner's product and their response to controlled stimulus, a 
variety of situations are to be designd to tap learner's achievement status. 
The test-developer is not necessarily required to confine to paper and pencil 
mode of presenting items. 


Different types of items such as essay types, short answer type, fill 
in the blank in sentences or in diagrams, alternative response or true-false, 
multiple choice in verbal and visual forms and lastly Matching and 
sequencing type may be exploited for measuring different abilities as 
specified in domain definition. But this in no way, is an exhaustive list. An 
inventive mind can think of very many more ways of generating items which 
may suit better to the domain definition than the conventionally used 
formats. 


2.3. Improving items 

The task of item-improvement starts with the review of domain 
definition for its concordance with the resulted items. It is followed by 
collecting empirical data for the potential of items for producing desired 


101 


responses. The former aspects studied under stimulus homogeneity and 
later under response homogeneity. 


2.31 Stimulus homogeneity: The preliminary draft of the domain 
definition is reviewed firstly for its concordance with test items and secondly 
for the quality of test items. It may be done by an outside expert on the 
subject. He may at least on judgemental basis check if the prepared items 
are congruent with the domain definition. At this stage it is likely that some 
ambiguities or other defects in domain definition may be discovered and 
modifications incorporated. At the second stage, the quality of the test items 
is monitored. For this purpose, one or more outside experts may review the 
degree to which items possess the quality consonant with the domain 
specifications. In their judgement if items appear to do the job, the criterion 
referenced test is supposed to possess the stimulus homogeneity. 


2.32 Response homogeneity Following the stimulus homogeneity 
Check, which is essentially an issue related with domain definition, an 
operational information on the test is required to assess whether most of the 
items are responded comparably by most of the learners. 


The revised draft of the test items is administered to a small group 
from the target population. The selection of the tryout sample deserve a 
special care, because selection of wrong group will produce meaningless 
results. By administering it to those who were deficient in the desired 
behaviour, a set of random response would be obtained and as such it 
would provide little information for improving the items. For tryout of the 
tests designed to tape cognitive, outcome, it is devised that the learners who 
in our best estimates have attained mastery on the desired behaviour 
domain, should be related, ў 


è 
For improving the test-items, the interpretation of tryout data is very 
crucial. Some practical. suggestions for diagnosing the deficiencies and 
pinpointing their sources have been presented in the following paragraphs: 


In the absence of widely approved techniques of processing 
response-homogeneity data, a good proportion of commonsense has to be 
employed in the improvement of test items. 


2.32 (i) In situations, where in the tryout data, a fair amount of 
response-variance is observed, the Norm referenced measures of item 
improvement through item analysis may be employed. For item-analysis in 
the context of Norm referenced testing, the interested readers are advised to 
consult Ebel's “Measuring educational Achievement". Item-analysis in that 
context involves the two statistical measures: one the difficulty value (DV) 
and another the discriminating power (DP). 
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The norms of DP's for accepted items in content of Norm 
referenced testing are not applicable in criterion referenced test situations, 
because with decreasing amount of response-variance, the DP's of items 
are reduced. And as an item does not disqualify itself even if it is correctly 
responded by all the tryout group, theoretically, the lower limit of DP for 
aceptable items reduces to zero. However, the argument does not defend 
the negative discriminatiors, and it is supposed to indicate some defect 
either in the setting of items or in domain definition. If the number of 
negative discriminators is low, it appears a plausible inference that the 
defect lies with the stting of the components or language of the test-items. 
For rectifying such defects, the items are either modified or they are 
eliminated. But, if the number of negative discriminators is high (exceeds 30 
per cent- and arbitary number) than some domain definition defects are 
likely. In such a case, whole of the domain should be reviewed and 
suggested improvements incorporated. 


2.32 (ii) In situations, where response variance is very low, the 
classifical item- analysis technique of Norm referenced tests is not 
applicable. In such instances, the main problem is not applicable. In such 
instances, the main problem is to decide how homogeneous the tryout 
responses to items should be. Followings the best possible domain 
definition rules, it is possible to identify items that would vary markedly in 
facility. Now the question is whether items of extremely different facility 
indeces are functioning so differently from rest of the items in the test that 
they represent a defect in the items or indicate a need to change domain 
definition. However, to pinpoint one of the two, following suggestions may 
be found useful- If the facility indices (Right-Response-Ratios) of 90-95 per 
cent or more items fall within the limits of Average-Right-Response-Ratio 
(ARRR) + 1 Response variance, the test is supposed to possess 
response-homogenity. In the framework of the items themseles. But in case 
when the facility indices of more than 5-10 per cent fall beyond the 
homogeneity interval as suggested above, it would always be fair to review 
and modify the domain definition. 


2.32 (1). One way of analysing items of a criterion referenced tests 
has been suggested by Gronlund (1977). The procedure assumes that the 
items are of little values in measuring the intended outcomes of instruction. 
Unless they are sensitive to instructional effects and take instructional effect 
as one basis for determining the item quality. 


To obtain this measure of item effectiveness the test developer 
must administer the test before and after instruction. Effective items will be 
answered correctly by a larger number of test takers in post-instruction-test 
than in pre-instruction-test. The index of sensitivity to instructional effect(s) 
can be computed by using the following formula: 
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RA-RB 


‚5 = Т 


Where ВА апа RB are the number of students answering the items 
correctly after instruction and before instruction correctly, and T is the total 
number answering the item both times. For instance in a class of N = 30 
students, if an item was responded incorrectly by all before instruction and 
correctly by all after the instruction, the value of instructional sensitivity index 


30-0 
S 30 - 1.00 


Thus, the maximum value of sensitivity index will be 1.00 and min 
value be 0.00. A larger positive value will indicate items with greater 
sensitivity to the effect of instruction. 


Though the index suffers from two serious weakness firstly a low 
index may be due to either an ineffective item or ineffective item or 
ineffective instruction and secondly it may be affected by external factors 
also, the sensitivity index is a useful means of evaluating the effectiveness 
of items in a criterion referenced mastery test. 


2.32 (iv) Besides the identifications of deficiencies, the more 
genuine question is that of modifying the domain definition or item 
improvement- In the absence of any consolidated rules or technology 
available for this aspect, the insight and experience of domain developers 
and item-writers will provide the main support structure. The components of 
domain must be reviewed for its congruence with testing situation and 
content. The items must be reviewed for the difficulty and objectivity of 
language, directions to respond and the nature and type of distractors. 


2.4 Validating the Test-items 


It means determining the two types of measures one reliability and 
another validity measures. 


2.41 Reliability: Reliability is as important a measure for criterion 
referenced tests as it is for norm referenced tests. Some earlier writer on 
measurement believes that all the measures of reliability, which are 
pertinent to norm referenced tests can also be used for criterion referenced 
tests, but later developments in the area convinced that there is a altogether 
different way of conceptulalising the issue of reliability of criterion referenced 
tests. The traditional way of computing the reliability measures heavily rely 
upon the presence of considerable response variance which is minimised in 
effective instructional treatment resulting into uniformly high performance. 


The meaning of the reliability estimates which are based on 
variance assumption are distorted in criterion referenced tests. When 
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variance is very low, less sophisticated but more meaningful estimates have 
been suggested by Popham (1981). 


2.41 (i) One useful way of estimating stability coefficient of 
consistency may be determined simply by computing the percentage of 
students scores that were on two different testing occasions different by 0-5 
per cent, 6-10 per cent, 11-15 per cent and 16-20 per cent etc. For example, 
a criterion referenced test (Mehra V. 1986) was administered to 90 students 
on two different occasions. The percentage of students corresponding to 
different classes of score-differences were obtained as follows: 


Percentage of students corredponding to Different classes of 
differences in scores 


Percentage of scores Percentage of students 
0-5 33 
6-10 32 
11-15 14 
16-20 4 
21-25 3 
26-30 1 
31-35 2 
36 and above 1 
90 


The above table indicates that majority of the proportions of 
differences between two students’ scores fall in the first two categories. This 
percentage decreases rapidly with subsequent classes. It indicates that the 
students show a marked level of consistency across the score levels. 
Hence, this criterion test may be considered reliable for measuring 
performance of students. 
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2.41 (ii) Another way of conceptualising the reliability of criterion 
referenced test is the educational view, according to which consistency has 
been viewed as stability of educational decisions based on test scores. For 
instance, if learners scoring 85 per cent or more are assigned to enrichment 
programme and those scoring below it to remedial programme. Now, a 
criterion referenced test is administered to the learners and the assignment 
decisions based on their performance are made. Then after an interval the 
test is re-administered and the assignment made on two testing ocasions 
gives an idea of the estimate of the consistency in decision. The estimate, of 
course, is not a very precise and sophisticated measure but is very useful 
for educational decisions. 


The internal consistency estimates presents a different problem. As 
the measure reflects the degree to which the item in a test are 
intercorrelated it displays consistency of individual items within a test. This 
aspect in the criterion referenced tests is studied in one way at response 
homogeneity check. Therefore, this estimate of consistency becomes a 
redundant operation. 


2.42 validity: \t is generally understood that a test should measure 
what it is really supposed to, and to the extent the test does it, it perform its 
responsibility. this desirable attribute of a criterion referenced test has been 
conceptualised in different ways. The two more useful and popular спеѕ 
have been briefly discussed in the following paragraphs: 


2.42 (i) Descriptive Validity: It is а new term used in the area of 
criterion referenced measurement. The use of a new term to a concept 
which overlaps partly with content validity of Norm referenced test is justified 
on the basis of a difference between the two in the shade of its emphasis. 
The content validity approach mainly concerns with the achievement testing 
in the specified content area where the proper content coverage was 
paramount. But the present educational evaluation has been broadened to 
encompass affectie as well as psychomotor responses which are not 
covered in content. This is the main rationale of the new term, 


Descriptive validity in a criterion referenced testing situation refers ` 
{о the degree to which it measures the class of learner's behaviour 
described in the domain definition. It is determined on the basis of the 
judgement of a domain expert. The job of the judge is to identify the 
proportion of the items congruent with the domain description. If a good 
proportion of items are found adequate with reference to the proposed 
domain desription, the test would be considered as possessing descriptive 
validity. I 

Criterion referenced tests lacking in descriptive validity are of little 
use to educational evaluation for in such instance one is not sure of what 
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the test is measuring. And consequently, it can not precisely suggest the 
recommendations for an instructional programme. 


2.42 (ii) Domain Selection Validity: Another way of determining the 
validity of criterion referenced test is Domain Selection Validity. It refers to 
how best a selected domain of learner behaviour indicates the learner's 
status with respect to a broader but less well defined instructional goal. The 
issue of domain selection validity is a matter of domain generalisability over 
the wider instructional goals. For example, a general goal of understanding 
the words of common usage in a language may be indicated by the three 
different domain specifications a, b and с (say). Out of such domains, the 
test constructors select one domain. But this selection is not a cafeteria like 
selection of «a domain from the available class of domains. The test 
constructor mentally assesses the all possible domains and rejects others 
for the selected one one the basis of their personal assessments. 


For determining the domain selection validity, the test constructor 
should develop different tests based on the competing domain descriptions. 
All thus, prepared tests are administered to a group and the tests are 
scored. A set of hypotehtical data obtained by sorting out test-takers 
according to those who scored at perfect level on each domain have been 
recorded in the following table: 


Those scoring perfect on Scores follows on the 
other two domains. 


It may be observed from the table that most of those who mastered 
domain "a" could also perform quite well on the other two domains. This is 
not so with other domains. For example, domain "b" appears to be much 
easier in relation to the other two domains. An individual mastering the 
domain "b" test is not all that likely to be able to generalise that mastery to 
the other domains. On the contrary domain "а" seems to be one that, if 
mastered can tell us whether students can master the other domains as 
well. It possesses high domain generalisability and hence high domain 
selection validity. 
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Algorithm of Development of a 
Criterion-Referenced Test 


D.J. Modi, 
ABSTRACT : 


Revolutionary changes have taken place in the 
teaching-learning process in almost last two decades. The 
emphasis is now put on individual learning and so the emphasis is 
now put on the measurement of within individual growth. 
Criterion-referenced testing (CRT) provides a description of 
student achievement relative to a standard defined by an intent of 
instruction in terms of skills competencies etc. Traditional type of 
testing-norm referenced testing (NRT) and CRT have much in 
common. Yet they differ in many aspects. Here, an attempt has 
been made to describe developmental procedures of 
Criterion-referenced testing.(CRT) 


INTRODUCTION 

Achievement testing plays a prominent role in all types of 
instructional progress of the individual. The achievement test is a systematic 
procedure for determining the amount a student has learned (through 
instruction) (Gronlund '-1977, p.1). The achievement test focuses upon an 
examinee's attaintments at a given point in time. (Popham-1981). 


It was felt before almost 15 years that achievement testing was in 
such a mess that a it became necessary to find a new system. It was clear 
to many that the traditional types of achievement testing i.e. 
norm-referenced testing (NRT) were totally unsuited for assessment of what 
individual has learned. They were not designed to examine the amount of 
subject matter known, but they were intended to decide how well a student’s 
general academic performance be compared with that of group of students. 
That is, they were referenced primarily to the norm group and only 
secondarily to the subject area. Again, it (NRT) does not take into account 
the progressive within-individual growth (Carver-1974). 


In the traditional type of testing, as it has been pointed out that 
objectives were not framed so as to cover whole universe of behaviours 
connected with the subject matter/content area intended to be mastered. 
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Only some of them were tested. There was no objective technique to select 
this objective rather than the other one. Objectives possessed overriding 
authority, yet they were rarely sufficient to generate test items without some 
slippage occuring between items and objectives. Item-writing process was 
subjective. It rested upon the judgement previous experiences and advice of 
the experts. The interpretation of test scores was referenced to 
norms-average performance of the group. But norms change when groups 
change. It is in answer to such ‘mess’ criterion-referenced testing (CRT) is 
being advocated. 


Revolutionary changes have taken place in the field of teaching and 
learning processes in last two decades. Individualised instruction is 
emphasised. So designing of the tests should be so as to reflect the 
changes in what the individual has learned (Meclelland-1973). Again, as an 
individual progresses through a programme of instruction, educators want to 
know at various junctures, precisely what the student knows and whether he 
knows enough to go on. 


The CRT mainly focusses on an measurement of attainment of an 
individual in terms of well-defined domain specifications; adopts rigorous 
and more scientific techiniques to specify subject-matter and objectives in 
terms of knowledge, skill and competences, (domain-specifications); sets 
definite and clear rules for generating homogeneous item-writing 
techniques. To Popham (1981, p. 26) the fundamental destination between 
NRT and CRT is basically based on the manner in which one interprets the 
results of an examinee's test performance. In the case of NRT one 
interprets someone's test performance according to the performances of 
others; in case of CRTone interprets someone's test performances in relation 
to well defined class of knowledge, skills, attitudes and the like. In very real 
sense, interpretations are mde relatively for norm-referenced tests and 
absolutely for criterion-referenced tests. The foregoing presentation tries to 
indicate the reader why people favour CRT than NRT. 


The purpose of this paper is to describe the steps involved in valid 
and reliable preparation of CRT. Criterion-reterenced tests апа 
norm-referenced test has much in common. Yet they differ in many vital 
aspects. So the reader will find some common steps carried out by adopting 
different techniques and/or procedures or same ones with different 
purposes. 


DEVELOPMENT STEPS OF CRITERION-REFERENCED TESTING 
Here an attempt has been made to describe a more reliable 
guidance for development of CRT, without which the development and 
effective usage of CRT may be hampered. While going through these steps 
the reader will find that the principle concern is to obtain a precise and clea 
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domain specifications ‘for maximising the interpretability of an individual's 
domain score. 


There are in all twelve steps advocatd by Hambleton (1982, 1985, 
1986). Each step is briefly described one by one: 


1: Preliminary considerations 
1.1 Specify test purposes 
1.2 Identify groups to be measured 
1.8 Determine the time and money available to produce the test 
1.4 Identify qualified staff 
1.5 Specify an initial estimate of test length. 

1.1 The process of specifying the propose of the test is of the 
foremost aspect to be thought out in test construction. The purpose of 
the test will influence the appropriate breadth of domain. (Hamblton 1982, p. 
395). For example, if the purpose is to provide feed-back, the coverage of all 
the units becomes necessary; if the purpose is to find out causes of 
recurring learning difficulties (diagnosis) the selection of content will be 
based on common sources of learning errors. If the purpsose is to assign 
grade or certify mastery the coverage of the content will be wider and items 
will have a wide range of difficulty. Thus the purpose will decide the breadth, 
depth and coverage of the content to be chosen for the test and also the 
length of the test. ` 


1.2 After pin-pointing the purpose, the next step is to decide for 
whom the test is to be constructed. i.e. primary, secondary etc. The decision 
about the standard will in turn help in selecting the content area of the 
standard. 


1.3 The development of the CRT also depends upon the availability 
of time and money. The preparation of CRT for a subject of one standard 
may take year and will need money in abundance (in lakh). So the test 
constructor has to curtail the test according to the availability of time and 
money. 


1.4 The services of various types of experts in the development ot 
CRT is a must. The services of content and evaluation experts for review of 
domain specifications and for validation of items are necessary, It may be 
noted here that the content and evaluation experts preferably should be two 
in one. For the successful completion of project in time and to fascilitate the 
work one has to identify the experts at fairly advance stage of the project. 


1.5 Initially the tentative length of the test will depend upon the 
purpose, the availability of time and money. Actual number of the items to 
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be included in the test can be decided at the time ‘of the assembly of the 
test. 


The careful reader must have noted that the all preliminary aspects 
are inter-dependent. 


STEP-2 : THE SECOND STEP CONSISTS OF: 
2.4 Domain specification and 
2.2 Review of objectives 


2.1 Domain Specifications: 


All tests have certain specifications describing the broad range of 
subject/content area and skills, competencies etc. to be assessed. Domain 
specification is a new development in achievement testing CRT.C Baker, 
1974; Millman, 1974; Popham, 1975 in Hambleton 1956, p. 32) Domain 
specifications clarify the. intended subject? content area specified by an 
objective. Such specifications attempt to reduce the uncertainity of the test 
item-writer in creating comparable items, that is., items which represent the 
same universe of content, skills and competencies. (Baker-1985, p 1447). 


The problem consists of describing what content, the respondent 
will be faced with, and under what conditions and in what form the response 
is expected and also the criteria to judge the adequacy of the performance. 
Thus domain specification consists of: 

2.1.1» Specifying content area 

2.1.2 Specifying understanding skills, competencies to be developed/ to 
be mastered through the content area 

2.1.3 Framing the objectives on the basis of (a) & (b) Each one is 
explained in brief one by one: 

2.1.1 In the process of content analysis one has to divide intended 
subject/content into small topics/units and/or sup-topics, sub-units. They 
should be devised in such a way that they specify homogenous content 
area. It may be extensive or may be narrow depending upon the purpose 
and nature of the subject) the content limits must be well-specified and 
clear. The content limits of the domain are its (CRT's) most critical feature 
(Baker 1985) p 1447). 


2.1.2 Again those small topics/units should be transformed into 
various tasks to be performed, understanding to be attained competencies 
and skills to be developed. 


They must follow the content to be mastered. 
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2.13 Framing the objectives оп the basis of (а) and (b). The special 
effort is needed to be made for clarifying the content into objective. Each 
objective description must clearly define the behaviour, domain based on 
content if the description is to be of help to item-writers and later to test 
score users. 


It should be noted here that it is easier to set content limits of a 
domain in the subjects with clear structure like mathematics, science, 
grammar of language etc. and become more tougher as one moves to less 
arbitrary arranged content area like literature, art etc. 


2.2 After completion of writing objectives i.e. intended competency 
statements, review of competency statements is to be made. The review 
must be made for clarity and competenceness that is one should check. 

2.2.1 Whether objectives are stated clearly in terms of performance, the 
conditions and criterion of acceptable performance, instead of 
writing objectives like -- - 


1. The student will be able to write a vowel coaliated, or 


2; The students will understand how to coaliate the word, or 
3. The studentes will know about vowel coalition, 


One should write a complete statement of objective like 


The students will write vowel coaliated word (Performance); when 
presented with vowel ending word with a word beginning with a vowel may 
be of same class or different class (conditions) with correct spelling (criterion 
of acceptable performance). 0 

2.2.2. Whether all important topics/units of content are taken аге of. 

2.2.3 That there is no overlaping or nothing is left out 

2.2.4 The language of the statements is unambiguous. That is 
statements are written in simple, clear and precise language 

2.2.5 Тһе. testing situation is appropriate for the intended 
conten/behaviour to be assessed. 


STEP 3 : ITEM WRITING 
After domain specifications are made and reviewed items are 
written to measure domain specifications. This step involves- 
3.1 the identification of suitable item writing technique according to the 
content and/or objectives to be measured. 
3.2 Actual writing of the items 
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3.1 Item Ariting Techniques 
The technology of item writing aims to reduce subjectivity in item 
generation. It generates the items according to specified rules. It is feared 
that rigour and precision of item writing methods were inversely related to 
their practicability (Baker-1985). Still however the objectivity precision and 
homogeneity in item writing is increased by use of such methods. The most 
notable are: 
3.1.1. Items for prose learning 
3.1.2 ltem forms 
3.1.3 Mapping sentence method/Facet Design 
3.1.4. Domain based concept testing 
3.1.5 LOGIQ (Logical Operations for generating intended questions) 
3.1.6 QI (Instructional Quality Inventory) 
Here each one is described in brief: 


3.11 Items for prose learning 

This method is based on sentence. This technique has been 
advanced for transforming important sentences occuring in prose material 
into test questions. (Roid and Haladyana-1982, p. 231). From the pool of 
such transformed sentences, test-items are prepared to assign some of the 
objectives of prose (reading) comprehension. The selection of the 
sentences is made randomly or stratified random sampling. This technique 
was first advocated by Bormuth in 1970. And then tried by number of other 
test specialiats (Roid and Haladyana, 1982, p. 231). 


3.1.2 Item Forms 

Perhaps this is the only one technique which defines precisely the 
content limit, the conditions under which the content will be presented, the 
criterion for acceptable performance. This technique is more suitable to 
such subject-matter where facts, information, concepts are to be tests, such 
as, structured material in mathematics, science, grammar of a language etc. 
This technique was first put forth by Hively (1974) and has been attempted 
and improved by many (Roid and Haldyana 1982, p.232). 


Item forms include 
(a) A general description of what the item form is about. 


(b) Мет for shell which provides a sample item with directions to 
examinee and examiner. 


(c) 
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Stimulus and response chaaracterastics which describe the 
theoretical characteristics of the item generation scheme and the 
dimensions which are varied to comprise the replacement sets. 

(d) Replacement schemes and replacement sets which detail the 
exact machanics of generating item pools for the given domain 

(е) Scoring specifications which describe properties to be used to 
differentiate correct and incorrect response. 


3.4.3 Mapping-sentence Method 1 

It is also called Facet Theory. It provides a structure and boundaries 
of testing conditions based on analysis of the structure a subject matter 
content, This method is mainly used for aenerating items for social studies, 
still it is found that the method works equally well in more abstract material 
also. This method is advocated by Engel and Martuza (1976). 


3.1.4 Domain-Based Concept Testing 

This method is not only a testing technique but system for defining 
concepts and teaching them also. (Roid and Haladyana 1982, p. 232). Thus 
it co-ordinates teaching with testing. It helps in analysing concepts related in 
skill and job-related or typical work-related training areas. The importance of 
concept analysis is of great value in education. This method was put forth by 
Tieman and Markle in 1976. 


3.1.5 LOGIQ (Logical Operations for Generating intended Questions) 

Another type of typology is given by William and Haladyana is 
LOGIQ. Which is concerned with construction of higher levels of test items 
and provides rules for matching syntectical forms with objectives at various 
cognitive level. (Harman 1985) content, tasks and response models are put 
in three dimensional matrix and classify objectives and test items according 
to matrix. After seleting the content and task, item writer can use the matrix 
to determine how to construct an appropriate test item. 


3.1.61. QI. (Instructional Quality Inventory) 

This technique is developed in United States for military training. 
The technique advocates content by task matrix and concerned with 
objective test consistency and adequacy 


As the nature of various content areas are varied, there will be a 
need for various types of techniques to assess such varied skills and 
competences related to various subject matters. All these technologies have 
provided a potential, theoretical but relevant description of content which is 
testable and understandable. More efforts are needed for easier and 
practical way for item generation with'scientific approach. 
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3.2 Item Writing 

2 The domain interpretation is possible only when an universe of 
items has been created and the test is based on a sample from the domain. 
this type of interpretation is the herat of CRT. The "term item-universe" has 
been used typically in the literature to connote a large collection of test items 
involving cognitive abilities (Shoemaker 1975) Usually the item generation 
with highly sophisticated technologies will generate large number of items. 


So, actual item writing for CRT should be taken up after choosing 
the suitable item generation technique, and suitable item format. The quality 
of test item should be uniformly high. Items so generated will be of various 
difficulty leels. 


Principles of item writing used in NRT construction apply to that of 
CRT as well. Still however, writing of items should be strickly according to 
the domain specifications. 


The editing of item increases the quality of the test items.So editing 
should be done after item writing. 


ЅТЕР-4 : ASSESSMENT OF CONTENT RELEVANCE OF ITEMS. 

it is rightly pointed out by Popham (1981, p. 108) that CRT 
developers should give much more systematic and intense attention to this 
important but characteristically under emphasized form of validity (content 
relevance). 


Messick (1975, pp. 960-961) has suggested not to use the word 
content validity but content relevance or content representativeness 
because validity is based/is calculated upon test scores. Hence the question 
is to decide whether items in a domain represent the knowledge, skill 
competenes specified in a particular domain or not. So the assessment of 
content relevance is of utmost importance in CRT. The content relevance is 
assessed through: 

4.1 Logical review by subject and evaluation experts (Step 4) 


4.2 Empirical review based on field test (Step 5) 


4.1 Logical Review 


The proposed logical item review is a process in which items are 
carefully scrutinized by inputs to ensure that the test is characterised by this 
important quality of item objective consistency. (Roid and Haladyana, 1982, 
pp. 203-213). This consistency is judged by subject and evaluation experts 
preferably two in one. Such judges can decide the appropriateness of the 
Specified knowledge, skill, competencies of intended content area and the 
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consistency of the items in relation to these domain specifications and also” 


review the items for their biases and stereotyping. 
For logical review of the items following steps should be carried out: 


4.1 To prepare a descriptive information sheet about the test items 
and elaborate behavioural specifications, specify content area to be 
measured and such other information needed in item development. 


4.1.2 Locate a group of five individuals who are sufficiently familiar 
with the subject matter and evaluation procedures. Hand over the 
constructed items to prepare the description of the items constructed as 
described in (1). 


4.1.3 Then give the descriptive information sheet to the subject and 
evaluation experts to judge the accuracy of the content, the adequacy of the 
content coverage and also accuracy of behavioural specifications. 


4.1.4 Locate another group of five experts and ask. Then to indicate 
from the pool of items, which item is congruent with the objectives given in 
the description. The index will be found out on the basis of the rating of 
judges for item objective congruence. 


Thus to check content relevance of each item perhaps is the most 
crucial of the validity approaches which should be established for good CRT 
(Popham, 1981, p. 107). ‚ 


STEP 5: REVISION OF TEST ITEMS ON THE BASIS OF STEP IV IS 
NECESSARY. 


STEP 6 : EMPIRICAL REVIEW 

The field testing of item-analysis has always been a source of 
ambivalence among advocates of CRT. Perfect congruence is expected 
between objectives and item: By adopring highly sophisticated technology о! 
item generation, there is the least possibility of having items which function 
less than satisfactorily. Still however experience indicates that there will 
always be some subjective element in formation of objectives and therefore 
in production of items. Therefore, there is room for some kind of item 
analysis. Е 


The aim of empirical review іл СА! is to find out items which are not 
sensitive to instruction i.e. which does not depict the changes within 
individuals. The best items in this view, are those which have P values 
approaching zero prior to instruction and P values approaching 1 (one) 
subsequent to instruction. if at all it is carried out, it is for detecting flow of 
items in terms of distracters, steps, language and sex or culture biases. At 
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„this stage test items. are organised into booklet form and given to’ : 
appropriate group of examinees. There are three statistical methods given — 
by Berk (1984): 

1. Pre-instruction post instruction measurements method (PPDI) 

2. _Uninstructed-instructed groups approach 

3. Contrasting groups approach 

Each of the above method has their merits and demerits. The first 

one is in use for several years. Cox and Vergas (‘66) (Roid and Haladyana 
1982 p. 218-220) have prepared pretest posttest differences index for 
Measuring gains between pretest given before instruction and post test 
given after instruction for instructional sensitivity. One can refer to Berk 
(1984) for details of the three approaches. 


STEP 7: REVISION OF TEST ITEMS 

Again revision of test items, if necessary, in view of item analysis 
should be carried out to remove any flow in any item. Usually, test items are 
not deleted but improved from flaws. 


STEP 8: TEST ASSEMBLY 
At the stage of test assembly decisions about following aspects 
should be taken: 
8.44 Determination of test-length 


8.2 Preparation of parallel forms 
8.3 Selection of test-items from the pool 


8.4 Prepare test directions, practical questions, test booklet, Lay-out, 
Scoring keys, answersheets etc. 


8.1 Determination of test length 


The determination of test length means determining the number of 
test items measuring each objective to include in a test. That is to decide 
how many test items to sample and how it is to be sampled. 


Eight items represent sufficient basis on which to assess student 
mastery ot to make instructional decisions from CRT data. (Hambleton, 
Hutton and Swaminathan 1976). That is minimum eight items should be 
sampled from the domain. The decision of number eight is again based on 
Bayesian estimate of domain score. Though maximum number of item tried 
was twenty. The increase in number of items improved both the goodness of 
fit measure only to modest extent. 


For further details one can refer to Hambleton (1984), where in 
other methods (1) Millman's Binominal Test Model (2) Novick and Lewis 
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Bayesian approach (3) Wilcox's Indifference Zone (4) Computer Simulation 
Methods (Egnor & Hambleton 1974) and (5) Methods based on use of Item 
Response Theory are discussed in detail. 


8.2 Preparation of parallel forms 


The importance of having parallel forms of test in instructional 
program is in re-testing of students whenever necessary. it will be more 
appropriate to test them with parallel form than with the initial test. 


8.3 Selection of test items 

A parallel form is prepared by sampling items from the domain or 
sub-domain either through a random or stratified random sampling plan. In a 
stratified random sampling plan, the items in the domain or sub-domain are 
divided initially into strata on difficulty levels content areas content and 
difficulty levels, instructional units and a specified proportion of items is 
sampled randomly from within each stratum (Shoemaker-1975, p. 137). 


The assessment of item representativeness should also be done by 
item revisers. This can be done by asking, 'How well does the set of items 
sample the domain of content and/or behaviour to the defined objective? 
Their opinion will be rating each item on five point scale. 


If the representativeness is not upto some desired level then again 
the revision of test items becomes necessary in light of the judgement of the 
reviewers. 


8.4 The procedure for the preparations of directions etc. in this 
section are to be completed before the administration of the test. They are 
well specified (say 'standardized') іп NRT procedures. Those are equally 
applicable to CRT. All these aspects should be carried out with utmost care 
and accuracy, so that the tasks to be performed at testing time is clear and 
well understood by testees. Sufficient care should be taken to ensure that 
copies of answer-sheets and question paper are legible and free from 
typographical errors. Sufficient number of practice questions should be put 
to each sub-domain test. Two parallel forms should be prepared at this 
Stage. Scoring keys and key out of the answer sheets should also be 
prepared. 


STEP9: SELECTION OF STANDARD 

More than 20 different methods for setting performance standard 
(cut off scores) have been recommended in the literature (Berk 1984). The 
Standard setting is necessary to assign examinees into mastery 'non 
mastery' status. That is to decide whether an examinee has enough 
knowledge, skills, competences specified in domain to go ahead or not. This 
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is considered to be the most difficult problem in criterion referenced testing. 
(Hambleton, 1985, p. 414). So, the test constructor at this step should find 
out persons who know the content well and are aware of the standard 
setting method one has to use in judgemental process. 


Berk (1985, pp1116-1117) divides these standard setting methods 
in two categories based on their assumption about the acquisition of the 
underlying trait or ability and again classifies them on the processes to be 
carried out-judgemental empirical or both. Thus lists four models/methods. 

9.1 State Models 
9.2 Continuum Models 
9.3  Judgemental Mehods 
9.4  Judgemental Empirical methods. 
Berk (1976) has given a detail description and computation of 


emprical method of setting standard. That method is preferable because it is 
empirical one. 


STEP 10: PILOT TEST ADMINISTRATION 
This step consists of following sub steps : 


10.1 Designing the administration to collect score reliability and validity 
information. 

10.2 Administering the test forms to appropriate chosen groups of 
examinees. 

10.3 Evaluating the test administration procedures, test items and 
Score reliability and validity 

10.4 Making final revisions based on data from (10.3) 

10.1 The designing the administration of the test wil consists of 
following aspects : 

18 Locating and providing а place (room) for proper working 
conditions at the time of administering the test. 

2. Keeping interuptions to a minimum 

3. Arranging enough space between testees to prevent cheating 

4. Arranging for black board, stop wateh and such other things useful 
for administration. 


In sum, an attempt should be made to provide tastes with most 
favourable conditions. 


10.2 Then the test should be administered to the appropriately. 


choosen group of examinees with appropriate gap, before and after 
instruction. 
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10.3 The pilot administration of the test will provide the responses 
of the students. The scores derived from the pretest and post test 
administration of parallel forms the relability and the validity of test scores 
should be assessed. The standard (cut off scores) is to be set at this point, 
and decision about mastery-non mastery is to be taken and the gain per 
individual is to be found out. 


Again the reliability and validity of the decisions mehtioned in the 
foregoing para is to be assessed choosing an appropriate validity methods 
and choosing appropriate category of reliability and indices of reliability. 


10.4 Any deficiency found in the procedures of administration of 
the test after evaluating the procedure of this stage, changes should be mde 
accordingly. 


STEP 11 : ADDITIONAL TECHNICAL DATA COLLECTION 
11.14 Validity 
11.2 Reliability 
11.1Validity : The last but important data for CRT is to find out 
validity and relibility of the test. Although validity is considered to be an 
improtant component of any test the criterion referenced test score validity 
has been paid little attention by researchers (9Hambleton, 1985, p. 417). 


The validation procedure in CRT is different from that of NRT. It has 
been argued that 'content validity" (IOC procedure) is a sufficient measure 
for CRT. But Messick (1975, p. 956) very rightly points out that content 
coverage is an important considerations in the test construction and 
interpretation, to be sure, but in itselt it does not provide validity (of the test). 


Hambleton (1985, pp. 203-205) has listed large number of methods 
classified into five major types of validity methods. They are : 

11.1.1. Intra-objective methods include item analysis, evaluation of 
test content and score-reliability. 

11.1.2. Inter-objeclive methods include what are often called 
‘convergent’ and ‘divergent’ validity studies. 

11.1.3. Criterion-related studies include prediction studies and other 
studies of the relationships between test scores and mastery 
classifications and independent measures of performance. 

11.1.4 Experimental methods include the determination of 
sensitivity of instruction on test content. 

11.1.5. Multi trait multi method studies address what is that a test 
actually measure. 

Criterion-referenced scores are commonly used to make decisions, 
Decision validity involves (a) setting a standard test performance (b) 
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comparing the test performance of two or more criterion groups relative to 
the specified standard. 


Some of the validation procedures have already been mentioned in 
earlier sections of this paper. They are: 


(1) Item-objective congruence (content validity) 

(2) Item analysis (item validity) 

(3) Pre-test post-test difference Index (PPDI) 
(Criterion related validity) 

(4) Setting a standard test performance (cut-off sores) 


11.2 Reliability: 

The reliability is concerned with the consistency of a test 
measurement over a time. The assessment procedure of consistency of 
decisions of CRT are totally different from NRT. Hambleton et al (1978, р. 
1523) listed three major categories of reliability- 


11.2.1 Reliability of criterion referenced test scores- where-in, 
decision about the square deviations of individual scores from cut Off score 
are consistent or not is to be taken. 


11.2.2 Reliability of criterion referenced test scores-where-in, 
decision about the square deviations of individual scores from cut Off score 
are consistent or not is to be taken. 


11.2.3 Reliability of domain score estimates. Where-in it is to be 
determined that the individual scores attained in domain are consistently 
attained on parallel forms, 


Again 13 reliability indices for criterion-referenced tests аге 

indentified and grouped into three Categories by Berk (1984, p. 202). 
(a) Threshold loss function 
(b) ^ Squared error loss function 
(c) | Domain score estimation. 

The criterion-test developer will have to choose during development 
process- 

1. An appropriate category of reliability 
2. Aspecific index wihin that category. 

There are five different approaches for estimating the reliability of 
mastery/non mastery classification named on the person who advocated it. 
They are discussed by Subkoviak (1984- pp. 267-291). Again a separae 
chapter is written by Breunan-1 984, pp. 292 to 334). The Chapter deals with 
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the reliability (dependability) of criterion-referenced test score апа to some 
extent reliability of domain scores estimates. Interested persons can go 


through them. 


STEP 12 : PREPARATION OF MANUALS 
The CRT developer must provide information material in the form of 
manual to aid qualified users to adminsiter test and interrpre the results of 
the test. It must include following aspects : 
12.1 A Test administrators’ manual 
12.2 A technical data manual. 


12.1 The manual for test administrators shold include : 


12.1.1 
12.1.2 


12.1.3 


12.1.4 


12.1.5 


12.1.6 
12.1.7 


Specific administrative guidelines. 

Specific expertise required or training required for for 
administration of the test. 

The oral directions be given to the testees at the time of 
testing (in written form) 

Directions regarding distrbution of test booklets filling 
general information, setting of a room at the time of test 
administration. 

How to deal with practice exercises 

Scoring procedures 

Nature and purposes of the test 


12.2 The criterion referenced test manual should also include- 


12.2.1 
12.2.2 
12.2.3 


12.2.4 
12.2.5 
12.2.6 


12.2.7 


Details and justification about the analysis of content and/or 
preparation of objectives (domain specifications) 

Item writing technology adopted 

Names and qualifications of the item writers and reviewers 
along with the note on processes adopted for review of 


items. 
The technical data about items of the test domain-wise 


The item selection procedure used 
Details about the number and age group, sex, std. of the 


examinees. 
The data about validity and reliability procedures adopted 


with their indices. 
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It is felt that the procedures for the development of CRT are 
being carried out with more and more scientific approach at all 
developmental steps. Still however for the development good CRT in India, 
much remains to be done. 
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Domain Description For Criterion 
Referenced Testing 


Pritam Singh 
ABSTRACT 


The definition of content for the purpose of CRT requires the 
identification of the domains. Discussing the definition of the 
parameters of domains the paper identifies two dimensions of 
domain description, the knowledge segment and the performane 
criterion in every domain. Accordingly, content elements and 
corresponding intended learning outcomes are used as domain 
descriptors. The criteria for identification of domains and the 
conditions needed for their specification are analysed in the light 
of CR testing which is usedmostlyas a progress diagnostic test. A 
model based on behavioural outcomes and intended criterion 
levels is proposed using the idea of hierarchy of abilities and 
interdependent organisation of content. Three models of domain 
description, the single act, close domain and open domain are 
illustrated. It is proposed that item validity can be established if 
domain is described easily, unambiguously and discretely. 


1. INTRODUCTION 


Criterion-referenced testing though a promising field of evaluation 
especially for its focus on improvement of teaching and learning, has yet to 
make its lee way in our classrooms. However, the fact remains that of late 
the need for criterion-referenced measures is incresingly appeciated by the 
evaluators who consider diagnostic function of evaluation more important 
than the judgemental function in the teaching learning process. These 

progress diagnostictests have their niche in.the warp and woof of the content 

area which not only is specified clearly but also delimited in its nature and 
scope of the content elements it covers and the outcomes of learning 
intended as a sequel to learning of predtermined segment of content. This 
potential content of a topic which forms a clearly defined segment of 
knowledge that lends itself to the attainment of pre-determined expected 
outcomes of learning if taught properly, may be called a ‘domain’. Of course, 
there are diverse opinions about the nature of a domain and accordingly 
diferent connotations are given to this word in relation to criterion-referenced 
tests. 
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2. CONCEPT OF DOMAIN 


Literature on criterion referenced measurement shows a great 
diversity’ in the use of the term domain’. Accordingly, to Dockrel (1975) ‘a 
behavioural domain is one which can be detected by an observer by Virtue 
of the pupils being able to do something new as a result of learning’. Since 
performance criteria are inherent in the concept of criterion-referenced 
measurement, specification and organisation of relevant behaviours i$ an 
important task of test development. Hively (1974) refers to it as 'domain 
referenced measurement or theory of performance' and emphasisescareful 
definition of the domain of relevant behaviours associated with an area of 
knowledge which would later on be used for test items to that domain. Thus 
according to Hively, the concept of domain includes both, the specitic 
content area and the behaviours associated with this content. 


Cronbach (1977) refers to the 'term' universe specification with a 
focus on skills and lists the situations in which observations are to be made, 
Observations provide valid sample of this universe, Ebel (1962) uses the 
term ‘standard domain of content’ with a focus on content that provides the 
basis for test construction. Baker (1981) uses ‘behavioural classes", 
universe of content and prescribed instructional outomes. Nitko (1980) uses 
the terminology of ‘standard domain of content’ domain or universe 
specification in relation to behavioural objectives. 


Popham (1975) prefers to call Domain referenced testing in relation 
lo content general objectives and simplified objectives. Skager (1975) used 
performance or content domain and performance objectives or behavioural 
output. Wilson (/) used domain of influence and universe of behaviours 
relating to over all major and sub-objectives. 


From the above connotations it is quite clear that different authors 
use different terminology which make it appear rather confusing and 
overlapping. Nevertheless, two underlying dimensions that constitute the 
bases of a domain are the content element or segment of knowledge and 
the performane criterion or the objective, in the form of intended outcomes 
of learning. accordingly, we may like to use content element and intended 
learning outcomes as the two facets of the same segment of knowlege, 
here-in-after to be called as the domain descriptors which form the basis of 
lest construction, later on. 


3. DOMAIN IDENTIFICATION 

Once we have appreciated the need to delineate the content and 
behaviours in the form of desired or acceptable criteria of performance with 
reference to a domain or segment of knowledge our next concern is to 
delineate the domains in a particular topic or unit of teaching. How many 
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domains could be identified from a paricular unit or how big or small a 
domain would be depends on a number of factors like the following : 


a) 
b) 


с) 


9 


е) 


Total time allocation to the unit іп the instructional plan. 


How much time is available or earmarked for testing the students 
on that unit? 

How homogenous or heterogenous are the chunks of content that 
can be divided into convenient integrated sub-units or sections 
which could become the basis for tsting? 

How many concepts or other content elements are spread over in 
a given unit of learning? The higher the density of the new 
elements per unit of teaching the more could be the number of 
domains in that unit. 

To what extent the content elements in a unit are amenable to 
develop intended learning outcomes? The higher level content 
elements like principles and concepts are likely to generate more 
intended outcomes thatn content elements like terms and facts. 

To what extent the developer is prepared to sacrifice ‘sufficiency’, 
aspect at the alter of brevity? 


In case one attempts to cover all content elements separately with 
a view to cover all intended outcomes of learning it becomes too 
unwieldy a list to handle within the available time. Let us assume 
that in a given unit of teaching, there are Say, 30 facts, five new 
terms, 10 concepts and 2 principles or generalisations and there 
are 2,2,5 and 5 intended learning outcomes corresponding to the 
above mentioned content elements. This would lead to (30 x 2) + 
(5 х2) + (10 x 5) i.e. 120 I.L.Os. In other words it must have four 
domains at least with 30 I.L.Os. vis-a-vis a minimum of 30 test 
items requiring one period each, or two periods if two parallel 
forms are to be prepared It would not be practicable within the time 
frame of evaluation to delineate and handle the total number of 
domains which would require testing time feasible in the 
classroom. 


On the other hand if we restrict content elements to only very few 
major concepts we might be left with only a single domain which 
would look comprehensiveenough tocover the unit but may ignore 
content elements which otherwise would have been useful from 
diagnosis point of view. Since criterion-referenced tests are indeed 
progressdiagnostic tests, formulation of global domains would not 
serve this purpose and it would indeed lead to development of the 
traditional unit tests rather than criterion-refernced tests. Thus size 
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of a domain should be determined keeping a balance beteen 
‘brevity’ in description and ‘sufficiency’ in coverage of content 
elements through corresponding I.L.Os. 


4. DOMAIN DESRCIPTORS 

From practical point of view we may like to use concepts as the 
most usable form of content element as one of the descriptors. Concepts 
may be identified in the domain and arranged in order of hierarchy, 
complexity, development or sequential order suitable for instructional 
purpose. This may be followed by the corresponding learning outcomes 
intended as a product of learning of those concepts. Depending upon the 
nature and scope of a concept, the list of intended learning outcomes in 
relation to knowledge, understanding and aplication objectives (NCERT 
taxonomy) arranged in taxonomic order. It may however, be considered 
essential that these I.L.Os should be listed in terms of specific behaviours 
indicating the various mental processes implied under the three given 
categories of objectives. Thus the ILOs should be slated. 


a. in terms of pupil's terminal behaviours 
b. using action words like recall, interpret, predict and 
с. reflecting the content element being tested. 


It is quite logical to say that lower level of I.L.Os like those of recall, 
recognise or identify, are attainable through every type of content element, 
whether it is a simple term, fact, concept, principle, proces or a higher order 
generalisation. However, it is not the case with higher level I.L.Os like those 
of learner's aility to analyse, hypothesise reason, predict or judge. Such 
outcomes are possible more easily through higher order content element 
like major concepts or principles. This provides us the hint for domain 
description. 


While formulating I.L.Os corresponding to various content element 
in a given domain one should try to associate higher order I.L.Os with higher 
level content elements and should not list lower order I.L.Os against higher 
order content elements. This would help to avoid unnecessary length of the 
test later on, and also prevent over-emphasis on lower level abilities. 
Following graph would indicate the usual pattern expected in a good 
description of domain assuming say, four content elements and four abilities 


Therefore, general pattern of combination to develop l.L.Os will be 
(TR, FR, CR, PR) + (ЕІ, СІ, Pl) + (CA, PA) + (PP). This may however, not 
be treated as obligatory to develop |.L.Os strictly in accordance this frame 
work. What is intended is that form this graph one can visualise the 
approach in developing a table of domain descriptors which can be made 
the basis of specification of question-wise grid of C.R.Ts. 


129 


Recall Interpret Analyse Predict 
(R) () (А) (Р) 


5. LEVELS OF DOMAIN DESCRIPTION 

As pointed out earlier domain description should not be unwieldy in 
length. To avoid this trivialisation it is preferable to prioritise the testing 
elements and leave out micro-elements like terms and facts and pin-point, 
the concepts which form the warp and woof of the content structure, Here 
the underlying assumption is that simple factual information is not so 
importantto test as the concepts or the principles identified in the domain. 
Secondly, testing for a concept automatically takes care of the facts and 
terms which are precursors of development of a concept. This approach is 
economical and at the same time takes care of the significant content 
elements which are crucial for developing the desired abilities to be listed as 
intended learning outcomes. What indeed is more important is that there 
should be no gaps in the hierarchy or sequencing of concepts on the one 
hand and the coverage of intended learning outcomes on the other. The 
illustraiton of a domain description in one of the unit of learning is shown in 
the annexure-A', Which gives the list of concepts and the corresponding 
1.L.Os in the domain, on which CRT was developed in one of the project 
undertaken in the department. 


It may be pointed out here that neither it is possible nor desirable to 
cover every ability implied under the main objectives of knowledge, 
understanding and application in a given domain,Much depends on nature 
and scope of textual content whether it would be possible to use its content 
elements as a vehicle for developing all those abilities. Neverthless , guiding 
prnciple is to list all those 11..Оз which a good teacher can strive for in 
teaching or developing the concepts indentifold in a domain. 


6. VALIDATING THE DOMAIN DESCRIPTION 


Depending upon the scope three types of domains as identified by 
Dockrel (1980) require separate treatment for validation purpose. In case of 
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single Act Domain only one discrete phenomenon is tested by а direct 
question, 
Example; The pupil will be able to recall the chemical formula of 
Sodium Hydroxide. (I.L.O.) 

In this case only one question, may be of any type, or format, can 
be asked to test this discrete single phenomenon. Therefore, complete 
agreement betweenan act of measurement and the attribute of domain is 
satisfied. It leads to precise measurement thereby ensuring maximum 
possible logical contiguity between the domain attribute and the item used 
for its measurements 


In case of Close Domain one may have 30 chemical formulae in a 
unit whch can be tested by putting one question оп each. However, usually 
a sample of 8, 10 or 15 chemical formulae may be tested and inference 
made on the basis of that sample. Validity to domain would depend on as to 
how fairly the sample represents the domain description and how well an 
estimate can be made from such a test about the coverage of the domain. 
Thus one cannot think of precise measurement in cloe domain as we can in 
single act domain, 


Third type is the Open Domain in which the intended learning 
outcomes is highly complex because of a compratively much larger chunk of 
the segment of knowledge and demands tests which are ороп Ив Scope, 
and providing a much broader basis of sampling which remains the 
underlying construct of such tests. Concept of open domain will be more 
clear trom the following example: 


EXAMPLE : 

"The pupil will be able to understand the concept of food chain". In 
this case, understanding of this concept would mean that the pupil will be 
able to discriminate between example and non-examples of food chains. In 
this type of domain we cannot limit tho examples and can go on adding new 
examples of different food chains. As such, sample would never be 
representative of the domain attributes as it could be in the case of close 
domain. For determining the validity of such a domain one has to depend on 
the expert opinion or judgement of the teachers. Accordingly with the as- 
sumption of some distribution were claimed to be refined by rol- 
tering to the same situation Le. the Norm. The purpose of element aa one of 
the descriptors. Concepts may be identified domain arranged 
order of hierarchy, complexity, development or sequentiall order suitable for 


the 
outcomes intended as a product of learning of those concepts. Therefore, 
functional validity of a domain can be determined by finding out the degree 
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to which a C.R.T. reveals to intended function (I.L.Os) verificable by empiri- 
cal techniques. 


7. TO CONCLUDE 


In the previous pages an atempt is made to define, identify and 
delineate the domains in a given unit of learning. A happy combination of 
'brevity and 'sufficiency' is desirable. Domain description involves the 
identification of the content eleménts and their sequencing, besides 
formulating the specific intended learning outcomes corresponding to each 
of the concepts or the content elements listed. Once a domain is properly 
defined it becomes basis for test specification in developing criterion 
referenced tests. Since the criterion referened tests are to be used for 
diagnosing adequacies and inadequacies in studenis' learning their 
description by the classroom teachers is warranted. Internal review by the 
developers followed by external review by the other teachers on the basis of 
concensus is the usual approach for validating a domain. It hardly needs 
mention that task in case of 'One Act Domain'and 'Close Domain' is much 
easier than in defining an ‘open domain’ which demands much more insight 
into the sampling procedures as well as the testing speifications. The more 
precisely, unambiguously and discretely a domain is defined in terms of its 
nature of content elements arranged in hierarchical order alongwith the 
corresponding intended learning outcomes the more easier it becomes for 
the test develoers to formulate test specifications for validating the domain 
and developing criterion referenced tests to find out inadequacies and 
adequacies in students' learning. 
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Аппехиге А 


Definition of Domain: 
Germination of seeds 


Concept Intended learn Questions 
ing out comes 
C1. The seed grows into its The pupils 1. When a seed 
own kind of plant, itis 1.1 explain ће begins to grow it is 
called germination of term germina- called? 
seeds. tion of seed 1. Plantation _ 
2. Transplantation 
3. Germination 
4. Gardening 
C2. Seeds require water, 2.1 геса the re- 2. For germination of 
air and temperature for quirements for seeds they need 
germination. germination. 1. only water. 
2. only air. 
3. only warmth. 
4. all of the above. 
C3. The germinated seeds 3.4 recall the term 3. 6.Germinated 
are called seeding. see- seedling seeds are called 
dling is a new plant. ‚ Plant 
2. Seedlings 
3. Shrub. 
4. fruit 
3.2recall the soil 4. Which of the fol- 
is necessary lowing things is 
for growth of necessary for the 
seedling growth of a see- 
ling. 
1. Soil 
2. Light 
3. Pot 
4. Shade 
3.3 identity that 5. Which of the fol- 
root part ap- lowing appearing 
pear first in first in the ger- 
germination. mination of а 
seed? 
1. Stem 
2. Root 
3. Leaf 


4. Flower 


3.4 identity — that 


Stem poton 
appear later in 
germination. 

C4. Two methods are 4.1 give examples 
used to grow new of seeds which 
plants. be come plant 

in the same 
(a) When seed be- place as sown. 


come plant in the same 
place as sown. 


(b) When seedling 
after a few cm. of 
height is transferred 4.2 give examples 


from their sown place of such seeds 
to another place for fur- їп which see- 
ther growth. dling are trans- 


ferred from 
sown place for 
further growth. 


In the germination 
of seed stem ap- 


pears 

1. before root. 
2. after root. 
3. with root 

4. after leaf. 


Which of the fol- 

lowing seeds be- 

come plants in the 

place as sown? , 

4. Wheat and Maiz 

2. Tomato and 
Brinjal 

3. Wheat and 
Sugarcane 

4, Paddy and 


Bara 3 
Paddy and Brinjal 
are similar as they 


agw 
i 


e 

. transferred to 
the field after 
seedling are 


grown. 

. not transferred 
to the field after 
the seedling are 
‘grown. 

3. grown by the 
same method 
as wheat. 

4. Planted and 

grow in small 


pots 
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Item Writing Techniques 


Navnit S.Rathod 
ABSTRACT 


This chapter intends to introduce four most important item 
writing techniques. (item form, Mapping sentence, Concept 
analysis and Structural approach). They can be applied to defined 
the content domain and generate the universe of items. Item 
writing techniques play pivotal. role in Crtierion-referenced test. 
Illustrations of all the item writing techniques are provided from the 
Subject of mathematics and presented in tabular form. 


1. INTRODUCTION 
Criterion-referenced test is a part of Systematic instruction. We are 
interested-in the domain score, rather than test (raw) score. Domain score 
can be obtained by using generalizability theory. The Random sampling of 
test items from the universe of the items is the basis of generalizability 
theory. Universe of items can be generated by applying the appropriate item 
writing technique. Item writing is a scientific and creative procedure. Item 
writing techniques provide rules for constructing test items. Item writing 
techniques are a direct extension of the Criterion-referenced test movement. 


‘Item Form' and ‘Mapping Sentence’ are used to generate the 
universe of items by a small computer. These techniques can be 
programmed for a computer and used in the computer based instruction. 


Item writing techniques are overviewed by Bejar (1983, pp. 9-17), 
Herman (1985, pp. 2748-2753) and Roid (1984, pp. 49-77). Six item writing 
techniques are compared by Berk (1980). A new item writing technique 
“Structural Approach” is invented by Scanduru (1977). Readers are 
requested to refer Roid and Haladyana (1982, pp. 91-200) for complete 
detail of item writing techniques. 


This chapter provides an introduction to item writing techniques. 
Item form, Mapping sentence, concept analysis and structural approach are 
included. Illustrations are given from the Mathematics subject. 
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2.0 ITEMFORM 

Item form is invented by Hively. Item form is more appropriate to the 
factual, scientific and quantitative areas. It is an important item writing 
technique. Hively’s item form is more formalized. It included a fixed 
syntactical structure. It consists of general description, stimulus and 
response, stimulus and response characteristics, cell matrix, item form shell, 
replacement scheme, replacement sets and scoring specifications. It has 
one or more dimensions. Each variable element has a defined replacement 
set. It provides exhaustive set of rules for generating a set of related items. 


An example of the item form is given in Table 1. Each of the 
elements in this item from will be explained. 


2.1 General description 

First is a general description of an item form. Student has to add 
two, one-digit integer (positive, negative or Zero). The student may use the 
given number line, if necessary. Student has to write the correct answer with 
proper sign in the blank, which is provided in the test paper. 


2.2 Stimulus and Response Characteristics 

Stimulus and response characteristics show content limits. For this 
item form, integers are one-digit (positive, negative or zero). The student is 
expected to write the correct answer, one or two-digits integer with 
appropriate sign, in the blank. 


2.3 Cell Matrix 

The cell matrix may have one or many dimensions. This cell matrix 
has two dimensions’. ‘a’ for first integer ‘b’ for second integer. First or 
second integer is positive, negative or zero. In this way, there are nine cells 
(3 x 3) in this matrix. 


2.4 Кет form shell 
A set of instructions for administering, the sample item and the list 
of materials are included in the item form shell. 


2.5 Replacement Scheme and replacement sets 

Exact and detailed mechanics of generating item pools for the given 
domain is provided by replacement scheme and replacement sets. There 
are three alternatives (positive, negative or zero) for first integer, same is the 
case for the second integer. So there are (3 x 3 = 9) nine cells in this 
replacement scheme. In this case, replacement sets are different one-digit 
intergers. For example, for cell one, they are from one to nine positive 
integers. 
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2.6 Scoring specifications 
Instructions are provided to distinguish between correct and 
incorrect responses. 


3. THE MAPPING SENTENCE (FACET DESIGN) 

Guttman proposed the application of the facet design to. educational 
measurement. The mapping sentence is the device for creating universe of 
items. Each mapping sentence consits of fixed and variable parts. Each of 
the variable parts are called facets. Variable elements are called facet 
elements. There are more than one element in each facets. The facet 
design specifies the domain and it constitutes a set of mapping sentences 
with all facets and facet elements. 


An example of the application of facet design is presented in Table 
2. Table 2 shows a mapping sentence with three facets that have three, two 
and four facet elements. One could compose twenty four ( 3x 2 x 4 = 24) 
different sentences from the mapping sentence shown in Table 2. Some of 
which could be true and some of which could be false. Two examples of 
true-false item are shown in Table 2 as sample items. 


By holding constant, the last two facets, and letting the first facet 
provide correct answer and two foils, the format could be changed to 
multiple choice. Twenty six multiple choice items are possible from these 
mapping sentences. Stems and foils are created systematically. 


If we remove one by one facet elements from 24 mapping 
sentences. Seventy two completion items could be generated. 


Computer programming is possible to generate universe of items by 
using the facet design. 


4.0 CONCEPT ANALYSIS 


Tiemann and Markele developed the concept analysis technique. 
The concept analysis is the most complete and systematic approach for the 
teaching and testing of the concepts. They stressed on the point that full 
understanding of a concept cannot be taught or tested by a single item. It is 
not possible to generalize from one example. the student needs at least one 
example and one non-example of the concept to discriminate. 


4.1 Critical Attributes 


Critical and variable attributes of the particular concept are 
systematically analized and listed. Common attributes of all members of the 
class, are called critical attributes. Absence of one or more critical attributes 
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provides non-example for the particular concept. Close in non-examples аге 
useful to test discrimination. 


4.2 Variable Attributes 

Variable attributes may differ among examples of a particular 
concept. Systematic rotation of variable attributes generated examples. We 
can prepare a list of examples that differs in variable attributes. 


4.3 List of examples and non-examples 

At the end, four types of lists must be prepared, teaching examples, 
teaching non-examples, testing examples and testing non-examples. These 
lists must be used during, teaching and testing. Seperate lists are necessary 
for teaching and testing. 


An illustration of "Concept Analysis" of "set" is shown in Table 3. 


5.0 STRUCTURAL APPROACH 

Scandura had proposed an alogrithm, which links instruction and 
measurement. The process of solving the problem must be well defined. 
From this alogorithm, we get the rules for generating items. An illustration of 
structural approach is given in Table 4. The diagram shows the teacher 
what to teach and the examiner what to test. Random selection of three 
constants a, b and c for the quadratic equation (ах2+ bx + с = 0), We сап 
generate the universe of items. 


Table 1 
An example of a formalized item form 


Item form 1 


Addition of two, one-digit integers (positive, negative or zero) with 
the help of a number line. 


GENERAL DESCRIPTION 
One- digit integer will be negative, zero or positive. The pupil is 
expected to add with the help of a number line, two, one-digit integers and 
write the answer with the appropriate sign, in the blank, which is given on 
the right hand side 
STIMULUS AND RESPONSE CHARACTERISTICS 
Stimulus Characteristics 
(i) Two integers having one-digit (each) 
(ii) ^ One digit integer may be negative, integer, positive integer or 
zero. 
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(ii) ^ Stimulus will be presented in the printed form. 


RESPONSE CHARACTERISTICS 
(i) One or two digit integer 
(i) ^ With appropriate + or — or without sign, when response is zero. 


CELL MATRIX 


Positive 
integer 


b, second integer 


Positive 


(1) 


а, First | Integer 
Integer | Zero (2) 


Negative 
Integer 


(3) 


» (9) 


ITEM FORM SHELL 


Materials 


Printed test papers 


printed number line 


Direction to examiner 
Distribute test 


Paper among students ` 


[29 


Pen or Pencil 
Script 


Add the two integers. 


Add the two integers. 
You can use number line 
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Write the answer in the blank. 


Sample йет : (+ 5) +(-3)= 


Replacement Scheme and Replacement Sets 


Replacement Replacement sets р items 
Scheme (Choose a & b randomly) per cell 
Bell 1 {а l0<as9,aez} {b]0<b<9,bez) | 81 
bell 2 (al a=0 } {b | 0< b<9,bez} 09 
bell 3 {a |-9 < a< 0,aez}{bl 0<b<9,bez} |81 

ell 4 (al0«as9,aez)(blb-o ) |09 
Кы {ala=0 }{b|b=0 } [01 
belie {а [-9< a<0,aez}{b|b=0 } | 09 

ell 7 {a]0<a<9,aez}{b|l-9<b<0.bez} |81 

ell 8 (alas0 Mbl-9sb«0,bez) |09 
ell9 {a|-9<a<0,aez}{b|-9<b<0,bez}| 81 

| 


Total number of items in the universe = 361 


SCORING SPECIFICATIONS 

A correct answer with the appropriate sign (plus or minus, if 
necessary), written in the blank, which is provided in the test paper, is 
considered as true, otherwise false. 


Table 2 
A Mapping Sentence 
Subject: | Mathematics, Geometry 
Unit : Parallelogram and its types 
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Objective : A Student will identity the properties of a parallelogram and 
its types, and mark the statements with appripriate signs, 
fight (ч) or wrong (X) in the blank. 


Sentence 
Facet T Facet 2 
The 1. Opposite 1, sides 


2. All 2. angles 


ofthe 2 2. Rectangle are congruent 


Sample Нет. 
be Mark the each statement(s) by right (Ч) or wrong (X). sign in tho 


— 1. The opposite sides of the рагайоюдгат are conguront. 
— 2 The adjacent sides of Ihe parallelogram are conguront. 
Universe of toms 
Facet 1 (3) x Facet 2 (2) х Facet 3 (3) 
= 24 True/f alse items 


e ——  — M —— 


Table3 
An Example of Concept Analysis 


Subject : Mathematics : Algebra 
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Concept: Set 
Critical Attribute 
1. From the description of a set, i must be possible to ascertain 
whethor a specific object is an element of the set. 
Variable Attributes 


2. Method of describing a set 


(а) The list method (b) the property method 
3. Number of elements in a set 


(a) Finite (b) Infinito 
4, Types of the elements 


(a) Concrete (b) Abstract 


Testing examples (Generalization) 
(Rotation of variable attributes) 
2a,3a,4a А» (Ramesh, Mahesh] 
2а, За, 4 B«(1,2,3) 
20, За, 4a С = [кх = people living on the earth) 
2b, За, 4b O= (Xx, x« 10, x «N) 
— —25.36.4b. Беде Natural num] . . 0 — 
Testing non examples (Discrimination) 


(Lack of critical attribute) 
(1) Two tasty dishes. 
(2) The three bost English dishes. 
The best student in ус class. eTe 


Quadratic equation 


axe bx+ c = 0 


Calculate 
A-b^—4ac 

CARO. gens 

Pas oa tz 

[e RE 


2a 


Calculate 


n b 
p+ q a=B ра 


p— iq Roots are 


Roots are 
complex 


Roots are real equel 
and distinct 
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SECTIONII 
JUDGING TEST QUALITIES 


Discussion About some of the 
fundamental issues pertaining to 
validity of a criterion referenced 
Achievement tests 


A.S. Dhaliwal 
ABSTRACT 


This paper attempts to discuss some fundamental issues 
relating to measurement. First fundamental issue refers to 
additivity of marks based on different questions or items included 
in an achievement test. In this connection whether it is valid to add 
marks pertaining to different questions in an achievement test is 
discussed. Place of absoute zero and condition of equal appearing 
intervals when it is bearing on measurement is explained. The 
second fundamental issue pertaining to the level of measurement 
related to the area of achievement of testing is discussed. While 
discussing the validity of criterion referenced results according to 
the author an achievement test is said to be valid when it 
measures what it purports to measure but operationally a valid 
achievement test ought to cover all independently example units 
of knowledge pertaining to a prescribed course. Concepts of 
criterion referenced measurement as given by vaious authors is 
Critically examined and referenced measurement as given by 
various authors is critically examined and referene is made to the 
three kinds of criterion referenced validation stratgies that is 
desruptive functional and domain selection validity. While coming 
p the author prefers to substitute the term critrion referenced by 
content referenced by content referenced measure. 


The crucial observation on which the thesis presented in this paper 
is founded lies in seeing the point that ‘Achievement Testing’ could not 
develop into a Science because, in this area of human knowledge, some of 
the basic and fundamental issues still remain unresolved. Whether it is 
Criterion-Referenced Measurement, or it is Norm-Referenced Measurement, 
or it is Content-Referenced Measurement, the contention of this paper is 
that without properly thrashing outsuch fundamental issues pertaining to 
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measurement of differences in acquisition of knowledge regarding 
prescribed courses, it is futile to talk about validity of scores based on any 
kind of achievement test. Some of such fundamental issues are discussed 
in the subsequent paragraphs. 


The first fundamental issue is whether it is valid to add marks 
pertaining to different questions included in an achievement test. Though 
the tradition which permits addition of marks based on different questions, or 
items included in an achievement test is very deeply rooted, yet we muster 
up courage to profess that it is invalid to do so. Mathematically, hence 
logically, the term additivity is very, very broad-based and technical. It is 
broad-based in the sense that it covers all the five basic arithmetical 
processes, namely addition, multiplication, subtraction, division and ratio; 
and the technicality related to this term may be realised from the act that the 
ramification in its use remain, most often, misunderstood, even in academic 
circles. 


In order to explain broad-basedness of additivity as a technical 
term, it may be realised that multiplication is graded addition; subtraction is 
just addition after adding minus sign for the subtractive; division is graded 
subtraction and when two numerical values are put to ratio, then also 
division is implied. Now, reversing the order if ratio0 is division, and division 
is graded subtraction and subtraction is addition on changing minus sign for 
the subtractive, and multiplication is another name for graded-addition, then 
it gets implied that there is, virtually, no basic difference in the processes of 
addition and ratio. Now, by implication, we are jutified to generalise tht only 


those measures may be added in case of which we may compute ratio. 


Mathematically, ratio may be computed from those measures which 
are related to the ratio scale. By definition, the ratio scale is supposed to 
fulfil the following two fundamental conditions : First, it is starting with an 
absolute zero which by definition, is indexing non-existence of the attribute 
to be measured; Second, it is  fulfiling the condition of 
equal-appearing-intervals. 


Here, it hardly needs to be emphasized that both of these conditions 
are very, very rigorous, in the sense that howsoever sophisticatedly we may 
design a test for measuring differences in academic attainments, it will not 
be possible to fulfil these two pre-requisites. The truth is that not only 
excellence but ignorance regarding academic attainments also tend to 
remain unfathomable. A cipher in the distribution of achievement scores, if 
any, is just a phenomenal zero and it is invalid to regard such a zero as true 
zero, which is indicative of non-existence of the attribute represented by it. 
Actually, magnitude of knowledge under such a phenomenal zero may be 
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either as large as the Atlantic Ocean nothing сап be said with confidence 
about the true depth of knowledge indexed by such a zero. Nevertheless, 
there is no justification for taking such a zero as the index of non-existence 
of knowledge about the prescribed courses. If any test is representing 
sampled-behaviour, then on what grounds we are justified to assume that 
true-zero is approachable in measurement of differences in academic 
attainments. No answer. However, only this much may be propounded that 
once the universe of knowledge in any area is circumscribed or prescribed, 
then we lose the prerogative to draw a sample. 


So far as fulfilment of the condition of equal-appearing- intervals in 
the RATIO-Scale is concerned, one thing is crystal clear. That is that the 
ratio-scale is invariably a single-unit scale and all the other units, or the 
intervals in the scale, are only the extension of that single unit. The purpose 
of extension of the scale by adding identical units is to facilitate the process 
of measurement, otherwise the final results of any two simultaneous 
measurements reached with the help of a single-unit scale and those 
prepared with the help of a multi-unit scale are to be perfectly similar. 


Here, one thing more needs to be pointed out. That is that, in order 
to ensure pertect objectivity in the measures obtained with the help of a 
single-unit, or a multi-unit scale, perfect agreement among experts over the 
basic unit of measuremnt becomes obligatory. For example, ‘FOOT’ іп the 
FOOT-POUND-SECOND(F.P.S.) system and ‘METER’ in the M.K.S. system 
are deposited as the universally agreed units of measurement in the 
museums, in their respective cultures, to ensure permanance and 
objectivity. 


Keeping in view what has been said in the preceding paragraphs, 
genuineness of the traditionally popular generalization which atresses that 
оп lenthening achievement test, reliability of the scores based on it goes up, 
becomes dubious. Some scientists have come to doubt the validity of this 
kind of gneralization. In this context, the observation made by Kiesler, et al 
(1969, p. 17) is in order : 

With respect to reliability, the TAYLOR and PARKER results 
would also seem to contradict the widely accepted mathematical 
Postulate that the reliability of the scaleis increased with the 
number of items in the scale. It is possible that the single-question 
technique gains as much reliability with its directness of 
questioning, which bypasses the belief structure, as it loses by 
being only one item. 

Now, by implication, we are justified to say that if scores based on 
items included in a test are not additive, then it is also reasonable to infer 
that achievement scores pertaining to different subjects of study are also not 
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additive. 


At this stage, another point at issue is that once it is accepted that 
scores based on different items included in an achievement test are not 
additive, and that for extension of a scientifically sound scale it is obligatory 
to have a universally agreed unit of measurement, all sorts of mathematical 
operation involved in carrying out various type of item-analyses and those 
used in computation of 'item-difficulty-indexes" and 
'item-discrimination-indexes' will turn out to be exercises in futility, 


The second fundamental issue which demands through probing 
pertains to the level of measurement with which we are concerned in the are 
of Achievement Testing. Though, quite a few philosopher-scientists (Steven, 
cited in Boring et al, 1963, p. 258; Stilson, 1966, p. 137; Garner and 
Creelman cited by Helson et al, 1969, p.4; MoNemar, 1969, pp 430-31; 
Simkins, 1969, p. 145; Moskowitz et al 1969, p. 214; Munn, 1969, pp 54-82; 
Kagan and haveman, 1972, p. 438, Willemsen, 1974, p. 182; Cuilford, 1975, 
p. 15; and Marx, 1976, p. 574) have started professing that measures of 
differences in human behaviour, which are directly contingent upon 
differences in psychic qualities, are amenable to the Ordinal scale. The 
contention of the present thesis against this kind of assertion is that the 
question of transforming summated scores into ordained ranks gets scutled 
with the objection raised against the tradition which permits addition of 
scorés based on different items included in a test. Obviously, if addition of 
scores related to different items is not logically plausible then оп what 
grounds we are jutified to use summated scores for deriving rank-orders. No 
answer. However, in this context, the following two observations seem to be 
very highly significant; 

First, that all the vital units of knowledge in a prescribed course are 
independent, in the sense that they are to be grasped or learned separately; 
and that on learning one such unit of knowledge, it is not possible to, 
automatically, learn other units of knowledge; and that independent and 
separate efforts are to be made for grasping all such units, 


Second, in learning ‘such like independent units of knowledge, 
individuals show qualitative, and not quantifiable, differences, meaning 
thereby that we can indicate individual differences in learning those 
independent units of knowledge -by using this or that type of qualifying 
terms, such as, ‘good’, ‘better’, ‘best’ etc. or simple alphabets namely, ‘A’, B, 
C, D, E, etc. as grades in case of essay answers and right and wrong (R or 
W) in case of objective-type tests. 

A bit of concentration on these two foregoing observations will reveal 
that in Achievement Testing measurement of differences does not rise 
above the level of the NOMINAL Scale. Though the Nominal scale is 
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considered to be the crudest scale, yet we muster ир courage to state that 

the conditions of this scale are also very rigorous. The Nominal scale 

demands tht the information regarding classification should be exhaustive, 

in the sense, that it should cover information pertaining to all the situations 

of observation and the object to be classified must accept exlusively this or 

that category. In this context, HAYS (1969, p. 7) observation is as follows : 
‘When a scheme of classification requires that each observation 
must go into only one category out of possible categories, and that 
even observation must go into some category, the set of 
categories is called a nominal scale. The categories making up 
nominal scale are said to be mutually exclusive (meaning thereby 
that each observation must be placed into only one category) and 
exhaustive (that is, every observation must be placed into some 
category") 

Before taking up the question as to in what way the implied truths 
underlying use of the Nominal Scale will be helpful in working out 
academically meaningful criterion for classification and categorization of 
individuals, it seems advisable to thrash out one more fundamental issue. 
That is tht how far it is justified to use the concepts 'MEAN' (or average), 
‘Standard-Deviation’ and ‘VARIANCE’, etc. for making results reached from 
the norm-referenced achievement tests. The truth is that all these statistical 
concepts are bound up withthe theory which seeks inspiration from two 
assumptions ; (i) that achievement test scores pertaining to a group of 
individuals аге potent enough to fulfil the conditions of 
NORMAL-DISTRIBUTION; (ii) and that achievement test scores are 
representing quantitative phenomena. After having accepted the truths that 
all the units of knowledge in a prescribed course are independent & 
complete in themselves and that such units are representing sheerly 
discrete phenomena which are subject to only qualitative-anlysis, all the 
concepts, namely mean and S.D., etc., should become anathematical. In 
this context, McNEMAR (1969, р. 431) is very sarcastic. He professes that : 

"It is easy to see that adding numbers, used to code qualitatively 
different categories, in order to compute a mean, is non-sensical. 
It is easy to see that ratio scale is required for a meaningful 
coefficient of variation. It is easy to see that rank positions as 
Scores will led to absurd standard deviation." 

Actually, mean as a concept when applied to differentiate human 
beings becomes very 'mean' and it fails to serve any useful purpose. The 
tradition which allows to use the mean Store as NORM for comparing 
performances of individuals taking an achievement test. is downright 
misleading. Logically, it seems plausible to assert that the CRITERION, or 
the Norm or the Standard or the Touch-stone (whatever term is to be use, it 
hardly matters) for giving a judgement about one's adequate or inadequate 
preparation of the prescribed corses must be worked out from the 
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circumscribed universes of the courses themselves and the results must be 
Content-Referenced. g 


HOW TO MAKE THE CRITERION-REFERENCED RESULTS VALID 


Though the discussion about certain fundamental issues contained 
in the preceding pages alludes towards crucial and revolutionary changes in 
the area of Achievement Testing, yet theoretically, connotation, denotation, 
intention and extension of the term ‘validity’ would remain unchanged. 
Theoretically, an achieement test is said to be VALID when it mesures what 
it purposts to measure. However, operationally, a valid achievement test 
ought to cover all the independently examinable units of knowledge 
pertaining to a prescribed course. The question-banks comprising all the 
independently examiable units covered in different subjects of curriculum 
may be prepared by boards of experts. But once the thus prepared 
exhaustive question-banks covering whole of the uiverses, їп their 
respective plaes, are available then as is implied in the definition of the 
Nominal Scale, the examinees must be forced to write answers to all the 
questions and no sample is to be drawn. 


The essay answers may be evaluated by using five 
category-responses : i.e. A, B, C, D & E as letter-grades and objective type 
answers may be adjudged as ‘R and ‘W’. \ 


Accepting that examination marks based on different questions 
included in achievement tests, and those pertaining to different subjects of 
curriculum are not additive and that examination marks are representing 
discrete phenomena, and not the differences on Continuum-Variables, it 
becomes obligatory ‘to treat examination marks as frequency data from 
which meaningful information may be derived through only counting of 
letter-grades and right answers to objective type qeustions. For example, if 
there are 50 propositions and 50 exercises selected as independently 
examinable units from Demonstrative Geometry prescribed. for the , 
Matriculation Examination and a particular student earns 'A' grade 100 per 
cent times, then the certificate based on these data will read as follows : 

‘Certified, with 100 per cent confidence, that Mr. so and so knows 
all the propositions and exercises included in the question-bank 
details of which are available in the personal profile and whenever 
he attempts any question, he obtains ‘A’ grade. Secondly, out of 
500 objective-type questions, he could give 400 correct answers.’ 

It seems advisable to Кер suchlike certificates separate for different 
subjects of curriculum. This recommendation is bound up with the theory 
which streses that an examinee as a living human being is fully justified to 
work in accordance with his interests and aptitudes. 
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From a perusal of the descriptions of the procedures to be adopted 
for preparation of the ‘Criterion-Referenced- Achievement-Test’, it may be 
realised how adequately and precisely we remained withi the framework 
emanting from the theoretical definitions of the the concept, 
‘Criterion-referenced measurement’. With a view to support our statement, 
below are reproduced some of such definitions : 

(i) \уеп (1970; р. 2) simply defines a criterion-referenced test as one 
comprised of items keyed to a set of behavioural objectives; 

(ii) Haris апа Stewart (1971, p.1) profess that ‘а pure 
criterion-referenced. test is one consisting of a sample of 
production tasks drawn from a well-defined population of 
performances, a sample that may be used to estimate the 
proportion at which the student can succeed. i 

(iii) Glaser and Nitko (1971) define criterion-referenced test as those 
deliberately constructed so as to yield scores directly interpretable 
in terms of specified performance standards. 

(iv) Popham (1978, p.2) says a criterion-referenced test is designed to 
produce a clear description of what an examinee’s performance 
on the testactually means. 

With the exception of the definition given by Harris and Stewart 
(1971, p.1), the detailed descriptions of the techniques given in our 
descussion about designing of the criterionreferenced tests seem to have 
very necely followed the tracks emerging from the above given definitions. 
We have stressed coverage of the courses and our conviction is that 
drawing of sample from te sircumscribed course is not advisable; nay, that is 
to play with the content-validity. Obviously, when validity goes, then both 
objectivity and reliability of the test scores become dubious. A 
criterion-referenced test is designed to yield a clearer picture of what an 
examinee's performance means regardless of how that performance 
compares to the performance of other. The thrust of the emerging 
criterion-referenced measurement technology, therefore, is on increasing 
the capabilities of criterion-referenced tests to produce lucid descriptions of 
examine's performance. 


It may be realised that the detailed discussion given in this paper 
refers to all the thee kinds of criterion-referenced validation strategies : that 
is, ‘DESCRIPTIVE VALIDITY, FUNCTIONAL VALIDITY апа 
DOMAIN-SELECTION VALIDITY." 


Towads the end of our discussion, it may be realised that the term 
'Criterion-Referenced-Measurement! is as ambiguous аз the 
‘Norm-Referenced-Measurement’ is. Actually, the terms ‘criterion’ and 
'norm' are very much overlapping, according to their dictionary meanings. 
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So, it seeks advisable to use the term Content-Referenced-Results instead 
of the term Criterion-Referenced-Measurement. It appears that in the hoary 
past, in Indian universities and pathshalas, there was a tradition to make 
examination results Content-referenced. Who knows that ‘Chaturg-vadi’ was 
the person who knew all the four Vedas, Trivedi knowing three Vedas, 
Dovedi knowing two Vedas and Vedi knowing one Veda. 
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Applictions of item response theory to 
criterion-referenced tests 


Navnit $. Rathod 
ABSTRACT 


Comparison of three measurement theories is presented. 
Limitations of classical test theory and generalisability theory are 
shown. Advantages of ltem response theory are discussed. 
Procedures of criterion-referenced test and mastery test 
construction and validation by using Rasch Model are presented. 


INTRODUCTION AND REVIEW 

The complete procedures for the mastery test development and 
validation is discussed by Lord (1980, Ch. 11) for three parameter item 
respone model. The limitations of Dpp Coefficient, difference between 
pre-test and post-test P values, which is suggested by Cox and Vergas, is 
shown by Van der Linden (1981). He suggests to use item information 
function, instead of Орр Coefficient. hambleton and Gruijter (1983) 
described the procedures of item selection for the mastery test, by using 
three-parameter item response’ model. Comparison of one, two and 


- 


There are many test theories, Among those, three theories are fully 
developed test theories. They are classical test theory, generalisability 
theory and item response theory. The comparison of these theories is given 
here. 


2л Classical Test Theory 
Classical test theory is traditional and widely used test theory. It is 
not free from some limitations. First, item parameters are sample bound and 
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thus they ae variants from sample fo sample. Second, persons scores are 
па du A AR eee neat ‚ Third, A test thas not more than one 
roli 


Random sampling of item trom a universe of items is a base of 
generalisability theory. We can apply this theory to estimate the domain 
score and to find generalisability of the domain score. We can obtin many 
important results from this theory which are not obtainable (гот classical 
test theory. But genoralisability theory is also not free from some imitations. 
First, we get partial generalisability because item parameters are not sample 
free, Second, it is total score oriented, but not item oriented. 


2.3 Item Respone Theory 
Item response theory has been used with tests on which examinees 


cach item item difficulty parameter, discriminating parameter and guessing 
parameter, Кот parameters are sample free. 


2.3.1 Advantages of item response theory 
1, _ Item difficulty, item discrimination and item guessing parameters 
аге Invariant across different level samples is one of the important 
characterisitc of пет response theory. 
2. Porson ability estimate is independent of the particular sample of 
test items, 
3. Standard. error of item parameters and person ability estimate aro 
provided. 
4. Addition or substration of items is possible to modify the test for 
the different purposes. 
5. Item response theory provides effective method for item selection. 
6. Мет response theory is one of the most effective and fully 
fneasurement theory to solve various measurement 


problems. 


2.32 Assumptions 
There are three assumptions in item response theory, 
unidimensionality, local independence and item characteristic curve, 


(a) Unidimensionality 

Unidimensionality is the characterisitc of the test. If the test is 
unidimensional, then, the set of items measures one parameter . We сап 
check unidimensionality by factor analysis. A test has one common factor 
when item intercorrelation matrix is of unit rank. If the first factor is larger 
than any other factor, then the items are approximately unidimensionnal. 
The test has one common factor, which is latent trait measured by the test. 


(b) Local Independence 

The probability of success on one item does not affect the 
probability of success of any other items. The probability of an examinee 
answering item i correctly is Pi and for item j is Pj. The probability of an 
examinee answering the ith and jth item correctly is Pi x Pj. These two 
events аге independent events. Local independence implies 
unidimensionality and vice versa. А 


(c) Item Characteristic Curve 2 

Item characteristic curve is the most important concept in item 
response theory. Item characterisitc curve varies from one item to another. 
Item characteristic curve is the regression of item score on the person 
ability. Item characteristic curve is discussed below in Rasch Model. 


2.3.3 Introduction to Rasch Model 


Rasch has provided probabilitic model, which is a new approach to 
solve measurement problems. 


There are two types of probabilitic models, logistic and normal 
ogive. Rasch model is logistic model. When scaling factor D =1 -7, these two 
types of models give almost the same result. The logistic model is easy to 
work out mathematically, so it is often preferred. 


Item response theory includes three item parameters, difficulty 
parameter, discriminating parameter and guessing parameter. When all 
items have equal discriminating parameter and guessing parameter is zero, 
the model is called Rasch Model. 


General Rasch Model accepts the response categories 0, 1, 2,.... 
.. m. But we will discuss only that model which includes only two response 
categories, zero and one, in this case data is dichotomous. 


The person and item parameters measure the same thing and they 
are expressed on the same scale. When a person is given a test item, the 
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outcome (pass or fail) is determined by the ability of the person and difficulty 
of the item. Odds of person x passing item i are defined by the ratio of 
person's ability(H)x abd item difficulty Bi 


on . Ox 


PA HI 


To convert this odds into natura | logarithms, substitute 


өх = хапа bi = In Bi, where Oxi = =" and Рх is the probability 
of a person x passing the item i. 


Oxi = Qx 


Bi 
ey c» cue 
à 
am Ir. Ms a 


Which is a Rasch Model. 


In ( c = (0x—bi) is called the logit. If we replace Өх by 


0 for any person then probability function Pi (0) is the probability that any 
person with ability 0 answers the item i correctly is written as 


(0-5) 
Pi(@) = is for each item 


Ie 12,1, п: 


The person and item parameters are defined on the same scale. 
Their range is from minus infinity to plus infinity. It is therefore the difference 
(0 — b) which affects the probability of correct response. Difference (0 ~ b) 
can vary from minus infinity to plus infinity. e°- can vary from zero to plus 
infinity. 
РіО) = 9 14.99-9 can vary from zero to one 


ITEM CHARACTERISTIC CURVE (ICC) 


Item characterisitc curve is a graphic representation of Pi(0). We get 
different ICCs for different items. Figures 1 & 2 show ICC and corresponding 
item information curves for two items. 
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The horizontal axis shows the ability scale Ө, the vertical axis shows 
Pi (0), as function of person's position on the ability Ө. 


From the figure 1, it is clear that 


mI 


Person < Кет difficulty = P (0) 
Parameter 0 = Parameter b 
> 


уз v 


The item difficulty parameter b corresponds to the location on the 
horizontal axis at which the probability is one half. Difficulty parameter of 
item 1 is bi = -1.5,and for item 2 is b2 = +1.5. Item 1 is easier than item 2. 


Information function is defined as 11 (0 ) =CPi (, 6 )x (1 — Pi (0 )) for 
Rasch Model. The large value.of information function is desirable and itis at 
b= 0 in one parameter model. 


РӨ) 


-3 -2 =! 0 + 1 332 +3 ө 


by = -1.5 b2- «1.5 
Figure 1. One parameter ICCs 
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10) 


25 Item-1 Item-2 


20 


05 


-3 -2 -1 0 +1 +2 +3 0 
Figure 2: Information Functions of items in Figure 


3. Procedures for development and validation of 
criterion-referenced test by using Rasch Model. 

f: Prepare a pool of valid items for well defined domain. 

2. Administer all the items to aproximately two hundred examinees. 

3. Estimate item difficulty parameter and person parameter by using 
MSCALE and MSTEPS computer programme which is prepared 
by Wnght and other. 

4. Check the fit between the Rasch Model and the response data. 
BICAL computer programme is useful to calculate analysis of fit. 

5. Item selection : decide the effective range of a test on 0 scale. 
Item information function is maximum at bi 0 i, select the items 
from lower limit to the upper limit of the test range, such as 


SE(bi) + SE(bj < |bi—bj| < 2(SE(bi) + SE(bj) 
6. Test information function (Reliability) 


The Concept of test information function is analogus to reliability. 
Test information function is the sum of the 
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п 
item information functions. 1(0) = di). It provides the test 
RE 


information at each ability level. 
q. Unidimensionality (Validity) 


Undimensionality of a test is found by factor analysis. It shows that 
the test measures one major factor. Thus Construct validity is 
obtained through unidimensionality. 

8. Domain : Compute the sum of item scores for each person Y -Z Vi 
(For Rasch Model weight wi is equal to one for each item), so the 
number right score is the person score. There is one to one 
correspondence between number fight score Y (persons score) 
and person ability q. From the item calibration process we get the 
person ability q for each person score Y. Prepare a score 
conversion table for the test to get person ability Ө (Domain score) 
from person score Y. 

4. Procedure for development and validation of ‘Mastery test’ 

by using Rasch Model. 

1 to 4. As above. 


5a. Determine a cut off score По (Proporation of the correct score at 
cut off point). 
Find 0o from ло by using the equation 
n 
To=- y P(60) 
ы 
Calculate Pi (00) for each item. 
5b. Кет selection ; Calculate item information function at 0o for each 


item. 


Select the item having maximum information at 6o. 


m 
5c. Calculate the cut off score Yo= Pj (80) 
Р E1 
for m slected items. 
6 & 7. As above. 
8. Master/Non-master decision 
т 
Calculate person score Y = Ў Ui 
ы 
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Assign each person, whose score is Y. 
Y= Yo as master and 


Y «YO as non master. 
To prepare a Criterion-referenced test or Mastery test by applying а 


Rasch Model is yet untouched (New) research area in India. 
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An Algorithm for Graph-Theoretic 
Measurement of Unidemensionality of 
Criterion-referenced test items 


Chandrakant Bhogayata 


GRAPH THEORY 


Graph theory is a branch of mathematics. In practical applications, a 
point of a graph represents some object of substantive interest and an arc of 
a graph suggests the existence of some relationship between the two ob- 
jects whose representative points are joined by the arc (Tatsuoka, 1986). 
The main purpose of graph theory is to facilitate the grasping and under- 
standing of the relational structure represented by a complex graph through 
simplifying the graph. It uses Euler's first theorem, and matrix and Booleean 
algebra as its mathematical tools for achieving its purpose. Graph-theoritic 
applications do not require the explicit use of graphs, but they require only 
their algebraic representations as adjacency matrices 


After conducting a survey of the educational and psychological 
research literature, Tatsuoka (1986) found three main applications of graph 
theory : (1) measurement of unidimensionality of a set of items in an 
educational or psychological test or extraction of unidimensional subjects of 
items from a multidimensional test; (2) determination of hierarchical 
structures among test-items; and (3) determination of hierarchical structures 
among instructional content units. 


UNIDIMENSIONALITY OF CRT ITEMS 


It is a well-known fact that Criterion-Referenced Test (CRT) items 
are generated from operationally well-defined behavioral or content 
domains. Criterion-referenced interpretation of test scores depends upon the 
degree of the specification of content domains. Unidimensionality is 
assumed among the item generated from a well-defined content domain of a 
CRT. This unidimensionality assumption must be tested empirically. The 
construct validation of CRT items may be the another name for the empirical 
test of the unidimensionality assumption. 
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Recently, Roid (1984, p. 66) has suggested to conduct a 
multidimensional analysis of CRT items for their construct validation as a 
new research area in Criterion-Referenced Measurement (CRM). Messick 
(1975) emphasized on the construct validation of both CR and NR 
achievement tests. 


GRAPH THEORY AND FACTOR ANALYSIS 

Traditionally, factor analysis is a statistical tool for construct 
validation or empirical testing of unidimensionality assumption. Reynolds 
(1981) showed a graph-theoretic computer program ERGO to outperform 
traditional factor analysis in isolating the correct number of unidimensional 
chains as well as yielidng reasonable ordering of item within chains. Wise 
and Tatsuoka (1986) also demonstrated that the modified graph-theoretic 
assessment of the dimensionality of dichotomous data was highly congruent 
withthe results of factor analysis. Appliation of graph theory to 
dimensionality problems is relatively less complicated than the application of 
factor analysis. So there seems to be little reason for educational 
researchers not to take advantage of graph theory in their work. Graph 
theory can be applied to the measurement of unidimensionality of CRT 
items. 


This paper presents an algorithm for the measurement of 
unidimensionality of CRT items of a behavioural or content domain. A 
hypothetical example is also given to facilitate the application of this 
algorithm. The algorithm is constructed after Tatsuoka's (1986) paper on 
graph theory. 


AN ALGORITHM FOR THE MEASUREMENT OF UNIDIMENSIONALITY. 
yu Select one or more instructional objectives, domains or units for 
which a CRT is to be constructed. 
2 Write items for each domain using any one of the operationally 
defined item generation technologies such as item forms, mapping 
sentences, linguistics-based transformations of prose materials, 


concept analyses etc. 

3. Administer the items to a reresentative group of examinees of an 
appropriate size. 

4. Score the items dichotomously. 

5. Prepare a binary score matrix S. Represent persons and items by 
rows andColumns, respectively. 


Denote Sj = 1 when person i gets item j right and zero 0 
otherwise. 
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6. Rearrange the items in matrix Sa from difficult to easy noting from 
the column sums of S. 

7. Prepare an adjacency supermatrix A. Denote the complement of 
Sa by Sa, with elements Sij = 1- Sij (i.e. a "Wrongs" score matrix). 
Now arrange Sa and the transpose Sa of Sa along with two null 
matrices, sothat A can have four submatrices. 


z 


8. Get squared adjacency supermatrix А? by squaring A. . 

9. Calculate cliffs’ consistency indice C+1 as the magnitude of 
unidimensionality from the upper-left submatrix (item-dominance 
matrix) М of A2 using the formula 


Са= Ул" Xni 
Г meg ЫІ 


where three sums are the above-diagonal under-diagonal and 
total elements of N. 


A HYPOTHETICAL EXAMPLE 


Suppose that a CRT for the concept of Sanskrit past participle 
(Karmani Bhoot Kridant) is to be constructed. Fifty items are witten for each 
of the seven domains of this concept by using Tiemann and Markle's 
concept analysis method. After administrering and scoring the items, the 
data is ready for applciation of the fifth and successive steps of the 
graph-theoretic algorithm. In actual practice, a computer program is 
necessary for squaring adjacency matrices of large number of persons and 
items. But for simplicity and brevity, only 5 persons (rows) x 4 items 
(columns) binary matrix is presented here : 

5. Prepare a binary score matrix 5. 
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6. Rearrange the items in matrix Sa from difficult to easy. 


Prepare an adjacency supermatrix A. 


ТА 


Get squared adjacency supermatrix д? 


8. 
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9. Calculate Cliffs’ consistency indice C11 from the upper-left 


submatrix (item-dominance matrix) N of A 


10—0 


Си = 10 


= 1.0 


In the present hypothetical example, the magnitude of the 
unidimensionality of the items of one domain is 1. Hence, the items of this 
domain are perectly unidimensional. ` 


There should-be established a minimum acceptable standard of the 
magnitude of unidimensionality for the items which are not perfectly. 
unidimensional. 


CONCLUDING REMARKS 
The algorithm presented in this paper may be useful also to resolve 
the following problems in criterion-referenced measruement : 
1: Empirical establishment of item-objective congruence in the form 
of criterial unidimensionality. 
2. Determination of the degree of domain specification. 


3. Comparison of two or more item-writing technologies for their 
unidimensional item-generation power. 
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Evaluating The Quality Of 
Criterion-Referenced Test Items 


Ved Prakash 
ABSTRACT 


The paper deals with quality of Criterion-Referenced Test 
Items. Rating by content specialists is discussed to highlight 
the need for judgemental validity. Empirical apprach for 
judging the quality(of test items is discussed withreferenceto 
the facility value and discrimination index. Techniques like 
upper and lower group index, pre-test and post-test difference 
index, Chi-square index and masters-non masters index are 
described. The concept of facility index and discrimination 
index as used in criterion-referenced test is differentiated from 
the norm-referenced test. The quality of Criterion Referenced 
Test in terms of these two parameters carries all together 
different conotation, usually referenced іо teaching 
effectiveness. 


1. INTRODUCTION : 

The strength of any test lies in the quality of its individual items. 
And the quality of an item depends upon the quality of its stem, key and 
distracters. There are certain Criteria which one should always strives 
for to bring in the quality with regard to each one of these components. 
Once item writers are made familiar to them they may generate better 
items than novices. Virtually no one wants to write bad or poor items, 
yet bad items are sometimes born. It is proved beyond doubt that there 
were many instances where professional item writers misjudged the 
quality of their items. It is not a strange as there is nothing like perfection 
in item writing. In fact in the process of item writing there is no knowing 
of the situation how the item may behave when it is tried out in the field. 
Thus, the item writers may only make some predictions about the 
functionality of their items with regard to specific sample which may or 
may not happen to be so. Since we have to live with all these 
constraints, we have to have some checks and counterchecks to make 
the system of selection as much scientific and rationale as possible. 
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A large number of methods have been employed for identifying 
quality items over the years in connection with norm-referenced test 
items. In all these methods we look for two item parameters namely the 
facility value and discrimination index. These parameters are used as 
aids for revision, rejection and selection of test items. There is a similar 
need to develop such methods for the identification of good and poor 
items with regard to criterion-referenced tests too. Obviously the 
methods and procedures employed for criterion referenced tests are 
going to be different from ones used for norm-referenced tests as they 
have altogether different purposes to serve. 

2. RATING BY CONTENT SPECIALISTS : 

This approach of judging the quality of test items is based on 
human judgement wherein one is to look for whether the items are 
consistent with the test's specification. Since there is human element 
involved іп this type of rating, there is always a danger of subjectivity. 
However, this can be minimised if we create a structured situation in 
which qualified experts judge all test items with regard to items 
consistency and test specifications. Rovinelli and Hambleton (1976) in 
their study recommended on the use of content specialists in judging the 
quality of test items. They are of the opinion that content specialists can 
not only complete their rating quickly but also with a high degree of 
reliability and validity. Coulson and Hambleton (1974) in another study 
showed that content specialists ratings alongwith empirical procedures 
provide an excellent basis for establishing content validity of 
domain-referenced test items. 


This approach depends more on local conditions such as the 
number of available judges, number of test items. and so on. This 
exercise may be completed in the following Steps : 

(i) identify four or five content experts who are willing to act as 
judges. 

(ii) provide test specifications for each set of items they will be 
judging. Неге test specification we mean different 
competencies measured by different items or by set of items. 

(i) ^ ask them to read test specification and then to review each 
item's congruence with its set of specifications and provide 
ratings and remarks. 

In this process of validation one should feel satisfied if the item 
is completely congruent with the specification. But if an item is found 
incongruent with any of the stipulations set forth in the test 
specifications, then the judges must label it as incongruent and give 
their remarks as to why it has been labelled as incongruent. If two or 
more judges identify the same item as incongruent, and for the same 
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reason then there is sufficient reason to reconsider that item with its 
specification. 


Whenever ther is a disagreement of opinions regarding the 
congruency of an item between two judges, a review period should be 
fixed wherein both judges or all judges may meet to discuss and arrive 
at a consensus. If there is no agreement during the review period 
between the judges more weightage should be given to the rating of an 
alert judge. The alertness of the judges can be found out by intentionally 
introducing some incongruent items in the test. The attentive and alert 
judges will spot such items and identify their inconsistency with the 
specifications while others will not. Remarks and ratings of such judge 
who fails to spot such items may be ignored. The judges should not be 
told about the inclusing of incongruent items in advance. Once this 
exercise is over we may identify good and poor items on the basis of the 
ratings of the experts. Accordingly, we may decide which items should 
be revised, seclected or rejected. 


3. EMPRICAL APPROACH 

This approach of judging the quality of test item is based on 
empirical evidences in which one is to look for items parameters like 
facility value and discrimination index. Like norm-referenced test we 
may compute these indices with respect to criterion-referenced test too. 


Before we proceed furiher we must remember that in 
norm-referenced tests, statistics like facility value and discrimination 
index are based on correlatonal analysis. As a result of which they are 
dependent upon a reasonable degree of variability in examinee's 
responses. If examine's responses are not spread over a wide 
continuum the correlational approaches do пої work. In 
criterion-referenced tests the situation is altogether different as there the 
emphais is to bridge up the gap between the masters and the 
non-masters which ultimately decreases the response variance of the 
examinees. In this case if we further improve our instructions naturally 
the response variance would still be lower. Because of this discrepancy 
the facility value and discrimination index with referenced tests do not 
carry cohventional meanings. 


3.1 Facility Value : 
The facility value of an item is a simple statistic which indicates 
how easy or difficult an item has proved to be and is, determined by 


calculating the percentage of examinees who answered if correctly. The 
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higher the percentage, the easier the item and vice-versa. It is usually 
calculated in the form of percentage 
R 
FV = М X100 
Where В = Number of examinees who answered it rightly. 
N - Total number of examinees attempting the item. 

What fácility value of an item is to be regarded as acceptable 
will depend of course on the kind of test one is trying to construct. It may 
be mentioned here that the range of the facility value is from zero to one 
hundred. Since in criterion-reterenced tests we wish to test the basic 
skills which are expected to be mastered by high percentage of the 
examinees, a higher facility value unlike norm-referenced test is always 
desirable while in norm referenced test any thing between 30 and 75 is 
considered desirable, in criterion-referenced test any value more than 
80 is considered most acceptable. The facility value іп 
criterion-referenced test may go even up to 100 and it may be 
considered fairly good as it indirectly signifies the teacher effectiveness, 
This is quite simple to understand. All the candidates will be able to 
answer an item correctly if the concept tested by that item is completely 
mastered by all of them and this is possible only if the instructions are 
effectively imparted to them. Thus the items with facility values between 
80 and and 100 may be accepted as good items in criterion-referenced 


tests. m x Р 
As regards to the facility value of Criterion-referenced test items 


one thing should never be forgotten that the low facility value of an item 
is not always because of the poor quality of the item, sometimes it may 
also be because of ineffective teaching. Therefore, the facility value of 
an item should not be taken as a sole criterion for the selection or 
rejection of the item. 


3.2 Discrimination Index: 

This is another numerical index whichis widely used for the 
identification of the quality of items. These statistics show the degree to 
which a particular item discriminates between the bright and poor 
students. There are various methods suggested for the estimation of 
discrimination index of Criterion-referenced test items. 


3.21 Upper and Lower Lower Group Index : 

This discrimination index may be estimated by subtracting the 
pass percentage of the lower group from that of upper group. The 
groups may be made either on the basis of upper 27% and lower 27 % 
or upper 33% and lower 33% or upper 50% and lower 50% as per the 
availablity of the sample size. Efforts must be made to make the two 
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groups as much distinct as possible to get better index as also as much: 
large as possible to reduce the sampling error. 
Ru _ Al 
ит a 
Where Ru = number of right responses in the upper group 
RI = Number of right response in the lower group 
М = Number of candidates in the group. 

For instance, let us say that in the upper group an item is 
correctly answered by-95 of the examines, but in the lower group only 
.62 of the examinees answer the item correctly. The difference .33 is the 
item discrimination index. 


3.2.2 Pretest-Posttest Difference Index : 

Cox and Vargas (1966) suggested the estimation of 
discrimination index by computing the percentage of students who pass 
an item on the posttest minus the percentage who pass the item on the 
pretest. 


ы . Bo _ Яр 
N N 


Where Rpo = Number of rights in the posttest 
Rpr = Number of rights in the pretest 
М = Number of Candidates 
When items are ranked on the basis of such an index, these 

ranking correlate only modestly with that of ranking based on more 
conventional indices. This was supported by Cox and Vargas in their 
study in which they estimated discrimination index by ^ calculating the 
percentage of students in the highest27%in total posttest scores who 
pass the items minus the lowest 2796 who pass the item. 


3.2.3 Chi-Square Index : 


This method was suggested by Popham (1978) on the basis of 
pretest and posttest results. It does take into account the effectiveness 
of the instructions imparted to the candidates. A four fold table is 
prepared where the four possible pretest-posttest results for test items is 
depicted according to whether the item is answered correctly or 
incorrectly. For any student, an item can be answered incorrectly (о) on 
the pretest, correctly(1) on the posttest or incorrectly (o) on the posttest. 
For some students an item may be answered correctly (1) on the pretest 
and also correctly (1) on the post-test. Such frequencies may be inserted 
in the following table. 
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TABLE—1 


retest & post test 
Results 


DI may be calculated for individual items with the help of the 
following formula : 5 
x? х Маа—Ьс) 
(a+b)(are)(b+0)(o+d) 
The Chi-Square value thus computed may be seen whether 
significant or not. Items with significant values may be accepted as good 
items. 


3.24 Masters-Nonmasters Index : 

Criterion-referenced tests are not expected to discriminate 
among all levels of competence but only between masters (who pass) 
and non-masters (who fail), Therefore, instead of taking the 
performance of top 27% and bottom 27% into account, we. may 
compare the performance of masters and non-masters by putting all 
failures (non-masters) in the lower group and those who pass (masters) 
in the upper group. We may prepare and use the following table for this 
as shown by Singh (1983). 


Discrimination index may be computed with the help of following 
formula ; 
Rp Rl 
ds np nf 
Where Rp = Number of examinees who passed the total test and 
answered the item correctly 
Rf = Number of examiees who failed the total test and 
answered the item correctly. 
np = Number of examinees who passed the total test 
nf = Number of examinees who failed the total test 
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Table —2 
Total Test 


Passed | Еайеа 


| Number 


Correct 30 6 


Total 


Like this we may compute the discrimination index for individual 
items. While computing the discrimination index we should always 
remember that the discrimination index of an item is subject to 
Considerable sampling error. The smaller the sample used in .the 
analysis, the larger the sampling error, An item that appears to be 
highly discriminating in one small sample may appear to be weak in 
discriminating another sample. Therefore, like facility value, this index is 
also group dependent. To overcome this problem it is necessary to 
establish the discrimination index of an item by administering it over a 
large sample. 


Since criterion-referenced tests are expected to discriminate 
among all levels of competence, one need not strive for higher 
discrimination indices іп criterion-referenced tests unlike 
norm-referenced tests. Therefore, items having discrimination indices 
between zero to 0.15 may be accepted as good items while the items 
with negative discrimination indices may be dropped out. 

For further reterence it may also be mentioned that besides 
these methods, two more methods namely Bayesian method based on 
Bayes’ theorem and Rasch method based on unidimension approach 
have also been proposed for the estimation of discrimination indices, 
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Establishing Reliability Of Criterion 
Referenced Tests 


К. Ramachandrachar D. Brahadeeswaran 


ABSTRACT 


There are many varieties of criterion-referenced tests. 
Viz., Domain referenced Tests, Objectives referenced tests 
and Mastery tests. A number of different approaches for 
estimating the reliability of criterion referenced tests (CRTs) 
have recently been proposed in an attempt to cope with 
possible lack of score variability that attenuates traditional 
coefficients. Three major categories of reliability appropriate to 
criterion referenced tests are : Reliability of criterion 
referenced test scores, Reliability of Domain score estimates 
and Reliability of mastery classification decisions. Depending 
on the purpose for which a CRT is used an appropriate 
reliability index has to be estimated. This paper describes 
procedures, that can be easilty used by teachers, for 
estimating reliability of CRTs on the basis of the data obtained 
from a single administration of the test. 


1.0 INTRODUTION 


The traditional norm-referenced tests are developed to obtain 
test scores suitable for ranking learners on the ability measured by the 
test. Criterion referenced tests are generally developed to obtain test 
scores suitable for evaluating the student's mastery of the instructional 
objectives covered in the test, so as to place him appropriately in the 
next instructional unit. 


Since criterion referenced tests are often used in conjunction 
with instructional programmes that maximise the number of students 
attaining the highest mastery states and minimise the variability of test 
scores, the classical correlation between scores on parallel tests or the 
ratio of true to observed variance may be attenuated by lack of 
variability and thus is unsatisfactory as an index of the reliability of the 
criterion referenced test. For this reason many researchers have 
developed new aproaches for estimating the reliability of criterion 
referenced tests. 
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This paper focusses on the procedures which can be used 
easily by teachers for estimating the reliability of criterion referenced 
tests. 


2.0 MANY VARIETIES OF CRITERION REFERENCED TESTS : 

A recent content analysis (Gray, 1978) of 57 descriptions of 
criterion referencing revealed that it was not unusual for different 
authors to use the term differently. 


Criterion referenced test was defined by Ivens (1970, p. 2) as 
one consisting of ‘item keyed to a set of behavioural objectives’ (Cited in 
Glaser and Nitko (1971) Nitko (1980) difined CRTs as those 
‘deliberately constructed so as to yield scores directly interpretable in 
terms of specified performance standards’. Harris and Stewart (1971) 
gave a much narrow definition. A pure CRT for them consisted of ‘a 
sample of production tasks drawn from a well defined population of 
performances ...’. Millman (1974) preferred to term the Harris-Stewart 
definition as giving ‘domain-referenced test’ and the lvens difinition as 
giving ‘objective based test’. A CRT for Popham (1975) was the test 
used to ascertain an individual's status with respect to a well defined 
behaviour domain’. 


The term "Criterion" in the phase "Criterion referenced test" 
refers to a behaviour domain. 


А major confusion in the definition of criterion referenced test is 
over the word ‘Criterion’. The confusion prevailed until a distinction 
between two conceptions of criterion was made by Popham (1981, p. 
27). The distinction is between ‘Criterion-as-a level’ and ‘Criterion as a 
desired behaviour’. The word ‘Criterion’, when used to refer a desired 
level, stands for a level conception. When it is used to refer target 
behaviours themselves, it stands for a desired behaviour conception. 
However according toPopham, interpreting criterion-as-a level of 
examinee proficiency has almost no advantages over traditional testing 
practives. In fact, such a meaning unwittingly prompts one to look upon 
any norm referenced test as a criterion referenced test. It looks as 
though it requires only setting a level of proficiency or the test to 
designate it as a CRT. Therefore it should be noted that CRTs require 
desired behaviour as criterion if they have to provide precise description 
of an examinee's status. 


2.4 Domain Referenced Tests : 
Popham's (1975) definition of criterin referenced test presented 
in section 2.0 is equivalent to the definition of Domain-referenced test 
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proposed by Hively, Maxwell, Rabehl, Sension and Lundin (1973). In 
their approach also, an examinee's performance is referenced to a well 
defined domain of learner behaviours. The term ‘domain’ refers to a 
small class of behaviours. It includes both (a) specific content area as 
well as (b) behaviours associated with this content. 


For a variety of reasons, however, testing specialists (e.g. 
Poham, 1978; Hambleton, Swaminathan,Algina and Coulson, 1978) 
have opted for the expression criterion-referenced measurement over 
domain-referenced measurement. 


2.2 Objectives-Referenced Tests : 


Objectives-referenced tests are those whose items have been 
constructed to measure an instructional objective. Usually ssuch 
objectives are formulated behaviouraly. 


As observed by Nitko (1983, p. 456) objectives - referenced 
tests may or may not be criterion - referenced. If objectives are written 
to describe a domain, and if items are then written to sample the 
behaviours in this domain, then this would fit the description of 
criterion-referenced tests. If objectives are stated very briefly they may 
not provide a clear description of the behavioural domain. Poorly 
articulated behavioural objectives characterize what are called 
"ill-defined domain" (Nitko, 1980, p. 466) and are not considered to lead 
to appropriate criterion referenced interpretation. 


23 Mastery Tests : 

According to Nitko (1983, p. 457) a test used to provide 
information to make a decision about whether a particular student has 
"mastered" a given instructional goal is called a mastery test. The term 
'mastery' is likely to mean different things in different contexts. Glaser 
and Nitko (1971, p. 64) have stated that ‘mastery’ in the instructional 
context implies that "an examinee makes a sufficient number of correct 
responses on the sample of tests items presented to him in order to 
support the generalization (from this sample of items to the domain or 
universe of items implied by an instructional objective) that he has 
attained the desired pre-specified degree of proficiency with respect to 
the domain". 


. This definition of mastery is closely associated with the idea of 
criterion-referencing. However, a mastery test need not be a criterion 
referenced test, although answering the question "mastery of what?" 
would be difficult without linking the answer to a well-difined 
performance domain. Given the present state of criterion-referencing, 
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the "mastery of what" question seems to be best answered by clearly 
specifying the domain of instructionally relevant behaviours which a 
learner commands. 


3.0 THE ISSUE OF CRT RELIABILITY : 

If student performance is to be referred to a criterion of 
performance resulting in a classification into masters and non-masters, 
the practice is to shape instruction so as to maximise the number of 
mastersresulling from the instructional programme. This practice 
obviously has the effect of minimising variability of test scores obtained 
from a Criterion referenced test. Thus classical correlation coefficient 
which depends on normal variability of test scores employed to indicate 
association between parallel/equivalent test performances is an 
unsatisfactory indicator of reliability of CRTS (Subkoviak, 1976). Further 
the notion of error variance is not acceptable in the case of CRT's. 


The reliability of norm-referenced test (rx) is defined as the ratio 
the true score variance bears with observed score variance. This is 
expressed in the following form : 

. Se 
Txx So 

where S,» = true score variance (i.e. the variance attributable to 
all the true variable 9s) supposedly operating in the test and Syo = 
observed score variance (i.e. the total variance including that which is 
attributable to errors in measurements and chance.) Rewriting the 
above formula using 


92 = 52 — 52, Меде! 
“осе 
_ 5,2 


Norm referenced test reliability Is often expressed in this 
manner. The basic definitional formula can also be re-written as :- 
[хх = oe = oe 
52 Si2+Se2 


In criterion referenced interretation where most students are 
shaped to reach the criterion by means of suitable instruction and 
provision of individually required time, the variability of the true scores is 
reduced. Thus ryy tends to zero. Or when Se2 tends to zero also the 
ration гух would remain undefined. Thus the classical estimates of 
reliability derived to represent the basic definition of ry are not suitable 
in a criterion referenced context. Hence different mathematical 
formulation have been propsed for finding reliability fo CRTs. 
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Hambleton et al (1978) identified three major categories of 
reliability appropriate to criterion referenced tests. 
(a) Reliability of Criterion referenced test scores : 
This type of reliability is appropriate when we want to decide 
whether the squared deviations of individual scores from the 
cut off score are consistant or not. 
(b) Reliability of Domain score estimates : 
When our focus is on estimating of examinee domain score 
from an examinee’s test score on a set of test items 
measuring an objective, reliability of domain score estimates 
has to be reported. 
(c) Reliability of mastery classification decisions : 
When a criterion referenced test is used to classify students 
into masters and nonmasters the index of reliability reported 
for such a test should reflect the degree to which students are 
consistenly assigned to the same mastery states acrorss 
parallel test administrations. 


4.0 RELIABILITY OF CRITERION REFERENCED TEST SCORES : 

When reliability is defined as the ratio of true variance to 
observed variance, (Lovett, 1977) it can be seen that the reliability 
coefficient is a measure of the amount of observed variance attributable 
to deviations in individual performance from some point C on the score 
Scale. In the case of Norm referenced tests the deviations are taken 
from the population mean. Livingston (1972) suggested, in the case of 
criterion referenced tests measuring score deviations from the average 
of all scores. The further group mean domain score is form the cur-off 
score, the more reliable the scores are said to be. 


41 Livingston's method of Estimating reliability of CRTs: 
Livingston (1972) defines observed and true variance, in the 

case of CRTs,as the expected squared deviation of the respective score 

from criterion 'C'. Then by analogy to NRT - reliability, CRT - reliability is 


defined as 
в = RD |. 60 
Е Dg) E(Dr Ner 

where E stands tor 'Expected value of' Dxi is the deviation of 
the observed score x of the ith person from C, DTi is the deviation of the 
true score T of the ith persons from C and ei2 is the expected squared 
deviation of the observed score from the corresponding true score. An 
estimate of Rec (ї.е., rcc) can be got by first finding reliability as though 
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it were a norm - referenced test and then adjusting it to CR situation 
using the formula 


Dose CX GI 
rc = SX +(K- x 


where гхх is norm referenced- estimate of the reliability of the 
test, Syo, an estimate of observed variance ox? around the population 
mean, and X,an estimate of pi, the test mean. Lovett (1978) suggests, 
for a K items CRT, a norm referenced reliability. can be judged using 
Kuder Richardson formulas - 20 or 21. This can then be substituted for 
rx in the formula for rcc. 


Livingston's procedure gives à determinate result even in the 
case where all testees have obtained the same score, 50 that score 
variance is zero; in this case the application of classical test theory in 
estimating the test reliability would have led to an indeterminate result. 


However, Livingston's procedure is ot without its critics. Harris 
(1972) points out that the standard error of measurement (which, when 
all is said and done, is the key statistic in reliability estimation) comes 
out the same whether derived by application of straight-forward 
classifical theory г by Livingston's modification of it. As observed by 
Hambleton et al (1978, p. 18), this is one reason for not rejecting all 
concepts from classical test theory with criterion referenced tests, The 
fact is that the standard error of measurement is one method of setting 
up confidence bands around domain score estimates, albeit a 
conservative method. However, this particular point does not detract 
from Livingston's formulation or from the usefulness of his statistic. 


4.2 Lovett's Anova Method of Estimating CRT Reliability : 

Я Defining the CRT reliability by analogy to norm referenced test 
theory as ‘the proportion of observed variance attributable to true 
variance’ and extending it to the mean of parallel measures, Lovett 
(1977) describes a generalised ANOVA procedure for estimating the 
reliability of criterion referenced tests from single administration data. In 
the case of CRT's, the deviations are taken from criterion C which is 
fixed without regard to the group mean (Livingston, 1972). The test 
situation consists of the K parallel measurements effected from n 
individuals. That is to say that there are K parallel CRTs administered to 
n individuals. First applying Spearman Brown's prophecy formula in the 
case of CRT’s as suggested by Livingston (1972), Lovett derives an 
expression for Rek, the CRT reliability of the sum or mean scores of the 
parallel measurements. 
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Applying ANOVA procedures, he further proves that 


re E (MSp) — E (М5) 
da Е (MSp) 
E stands for 'Expected value of 


There by he comes to гек a sample estimate of Rck 
MSp—MSe 
= m es 
where MSp=SSp/n, MSe = SSe/(k -1) (n- 1) 
and | 55р = Sum of squares among persons 
SSe - Residual or error sum of squares. 


Tck 


The ratio F (n, (к - 1) (n - t] = we would serve as a test of the null 
hypothesis Вск = 0 


5.0 RELIABILITY OF DOMAIN SCORE ETIMATES : 

Many methods are available for the estimation of examinee 
domain scores. In all these methods, we start with an examinee’s test 
score on a set of test items measuring an objective, and estimate the 
examinee's proportion - corect score if all of the items in th domain of 
items measuring the objective had been administered. The score being 
estimated is referred to as an examinee's "domain score". (Hambleton 
et al 1978, p 5) 


When there is test score variance, it is possible to estimate the 
standard error of measurement of acriterionreferenced test. Whereas 
reliability estimates for a test vary from one sample of examinees to 
another, the standard error of measurement is generally invariant across 
samples. (Lord and Novick, 1968) and is therefore rather useful for 
interpreting test scores, whether they be scores from a norm referenced 
lest or a criterion-referenced test. When strictly. parallel tests are 
available, well-known methods for estimating the standard error of 
measurement can be used. 


But in practice parallel-froms of a criterion-referenced test are 
often constructed by the randon sampling of items from a "pool" of test 
items keyed to an objective. Such tests are referred to as randomly or 
nominally parallel tests, and typically do not meet the requirements for 
strictly parallel tests. Randomly parallel tests are examples of the type of 
measurements for which generalizability theorey (Cronbach, Glaser, 
Nanda & Rajaratnam, 1972) is applicable. 
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Cronbach et al. (1972, рр. 25-26) defined three different errors 
of measurement. One error, Ai, is appropriate when the 
proportion-correct score is taken as an estimate of domain score. The 
error Ei is apropriate when a linear regression estimate of domain score 
is made, and the third error i, is appropriate when an estimate of the 
deviation between, the ith examinee’s domain score’ and the mean 
domain score is made. Discussion on the second error will not be 
presented here, because typically it is impossible to obtain a regression 
estimate of domain score on the basis of a single randomly parallel test 
(see Cronbach et al., pp. 140-146). Discussion on the third error will not 
be presented here, because typically there is no reason to estimate the 
deviation score with criterion-referenced tests. 


The error Ai is defined as the difference between the observed 
[U^ agen score and the domain score for the ith examinee. 
Suppose a domain of items exists. Let xij be the score (0 or 1) for the ith 
examinee on the jth item. Define Aij= xij —mi. 


For an n-item test, the error of measurementAi, is n^ YA; 
Cronbach et al. (1972) discussed thrée variances for Aj. These are 
o*Ay-mEjAj the error variance for examinee i on an n-item test 
constructed by random sampling of items;o% = EjC^A; the average 
over examinees of слу and 


A 
т? Е |У Aj — EYA|the variance over examinees of Ai 
TP nn 
for a given test. 

To evaluate the accuracy of a particular test, it would be 
appropriate to estimate the third variance mentioned above. However, 
estimation of the quantity requires the administration of several 
randomly paralel forms, which may not be feasible. 


Hambleton et al (1978, p. 18-19) an estimate of suggest can 
be obtained by laying out the item data as a one way ANOVA with 
examinees as the factor. item scores are considered to be replications 
within a level of the examinee factor. The estimate is given by 


US 
OM = га MS wp 
where MSwp is the within persons or replications mean square. 
If several randomly-parallel forms of an items each are available, then 


o can be estimated using the same formula. The proportion-correct 
scores on the various forms are the replications within a level of the 
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examineefactor. For each examinee oki can be calculated using the 
formula 5 
p (АЕ) р 2 / 
СД = (хү =R) 
1) 2, 

Where тї is the observed proportion correct score (estimated 
domain score)for the ith examinee. The factor (N-nyo is used when the 
domain is finite. N is the number of items in the domain. When n is small 
relative to N, the estimate б] may be quite variable over random 
samples of items. 


Another approach for determining the accuracy of domain score 
estimaes was reported by Millman (1974) and'Hambleton, Hutten & 
Swaminathan (1976). They suggested that the standard error of 
estimation derived from the binomial test model given by the 
expression.vx(1—n/n could be used to set up confidence bands 
around domain score estimates. This is a biased estimate and an 
unbiased estimate is obtained by substituting (n'- 1) for n in the 
expression. This is an expression for the standard deviation of errors of 
measurement for an examinee with domain score X x across 
administrations of n item samples drawn at random from an item pool. A 
Correction 


єп can be introduced under the radical sign when the pool of test 


items is finite. The advantages of this appfoach are that the estimate of 
error is a function of domain score, is less conservative than the 
estimates of error provided by the standard error of measurement, and 
the effect of test length on the precision of estimates can be studied 
easily. Further, the estimate is relatively easy to compute. 


6.0 RELIABILITY OF MASTERY CLASSIFICATION DECISIONS : 

According to Millman (1974) a test in which the range of 
possible scores is partitioned into K non-overlapping intervals that . 
define different levels of student mastery of а well-specified content 
domain is generally classified as criterion referenced". 


Hambleton and Novick (1973) Suggested that the reliability of 
mastery classification decisions should be defined in terms of the 
consistency of decisions from two administrations of the same test-or 
Parallel forms of a test. If examinees are to be classified into ^ mastery 
States, the index of reliability suggested by Hambleton and Novick 
(1973) is 


182 


т 
Ро = Рк 
k=l 


where Pkk is the proportion of examinees classified in the kth 
mastery state on the two administrations. The index Po then is the 
observed proportion of decisions that are in agreement. The Po statistic 
is easy to calculate, but it suffers from at least one limitation. 


Swaminthan, Hambleton, and Algina (1974) argued that Po 
does not take into account the proportion of agreement that ocurs by 
chance alone, and that therefore it could give a false impression to 
users of the extent of mastery classification consistency. They 
suggested using Kappa coefficient, k (Cohen, 1960) as an index of 
reliability. This coefficient is defined as 


k = (Ро—Рс)/(1—Рс),, 
Where Рс = У, РК.РК 
het 


The symbols Pk and Pk represent the proportions of examinees 
assigned to mastery state k on the first and second administrations, 
respectively. The symbol Pc represents the proportion of agreement that 
would occur even if the classifications based on the two administrations 
were statistically independent. Thus, in a sense, it can be argued that k 
takes into account the composition of the group, and that in this sense it 
is more group independent than the simple proportion ofagreement 
statistic, Po. The upper limit of k is +1 and can occur only when the 
marginal proportions for different administrations are equal. The lower 
limit is close to -1. The precise lower limit of k is unimportant in the 
context of criterion-referenced testing, since any negative value 
indicates inconsistency and, therefore, unreliable decisions. 


The coefficient k is dependent upon all factors that affect the 
decision-making procedure ; the cut-off score, the heterogeneity of the 
group of examinees, and the method of assigning examinees to mastery 
states. Hence it is suggested that all of these factors be presented when 
reporting К, since this information would contribute to its proper 
interpretation. 


The coefficient k and Po are defined in terms of repeated 
testings. Huynh (1976) and Subkoviak (1976) are developed methods 
of estimating k and Po respectively on the basis of the data obtained 
from a single administration of the test. These two procedures are 
described in the following sections. 
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6.1 Huynh's Method Of Computing Coefficient Of Agreement 
(kappa) 

Huynh 41976) proposed a method for estimating the kappa 
reliability index on the basis of a single test administration. This method 
is particularly suitable when testing is intermingled with instruction in the 
sense that lest data provide a basis for the granting or enial of mastery 
status on a set of objectives. 


The procedure suggested by Huynh involves the following 
Steps. 
Step 1: The sample mean M and standard deviation S are 
computed 
Step2: КА 21 reliability is computed by using the formula 


«n tu nr 


where n = number of items in one form of the test. 
Step 3: The beta density parameterswandBiare omputediusing 
formula 
о = (1 + 1/21)" and B = -œ + n/i- 


After step 3, depending on the size of C (criterion cut off score) 
either the chain of steps 4 a, 5a and 6a or the chain 4b, 5b have to be 
followed. 


If the cut off score C is small the steps listed below are to be 
followed С=1 
Step4a: The proportion Po = Y f(x) 
xd 


is computed ie. Po = f(0) +... 

H m aus pida 
ere, f(o) MT B—i 

and inductively f(x) is calculated for x = 0, 1... c - 1 using 
f(x+ 1) = f(x) (mxx) 


wt f(C- 1) 


(х+1 Хт8——1) 
Step 5а : Тһе proportion Poo= У, f (X,Y) is computed. 
X, Yo 


ie Poo = £(0,0) + (0,1) + e f( (c—1), (0—1)) 


184 


\ 


п 2 
roo «cil emm ee 


pent a+ p—i 


and inductively f(X, Y) for values of X, Y near 0 upto c -1 is 
calculated using 
ОХ, ү) =. t(%Y) eine 


(Х+1)(2т+В—Х—Ү—1) 


Step 6a : K(Kappa) is computed using 
(Роо—Р0) 


(Po—P6) 


If the cut-off score c is large, the steps listed below are to be 
followed :- 


n 
Step 4b : The proportion Р = У f (X) is computed 


ie. P1 2 f(n) +... 
Heref(n) = 
1= 1 
and inductively f (X) is calculated for values of X=n,n-1....C 
+ 1, c using 
a x 
c не Уу aes 


п 
Step 5b: The proportion p11 = У, f(X,Y) is computed i.e., P11 
хс 
= ((п,п) + f (n, n - 1) +... f(n - 1, п) +..Д(с,с) 


Here М 
$ 2nra—i 
f(nn) =f.(n) x Zo 
EA 2nroxp—i 


and inductively (X, Y) for values of X, Y near n upto C is 
calculated using :- f(X —1), Y), = (XY) 


X(2n«g—X— 
(п—Х+1)(о+Х+Ү—1) 
Step 6b : kappa is computed using 
(pii =) 
гааг 
(p— p) i 


185 


Computation labour is reduced considerably becaue the joint 
density of t (X, Y) is assumed to be symmetric and hence {(х,у) = 1 (у,х). 
Thus f(o,1) = {(1,0), f(n, n - 1) =f (n - 1, n) etc. 


As is the case with the traditional reliability index based on the 
relationship between true score and error of measurement, kappa 
increases as a function of test length and test score variability. Further, 
Kappa varies with cutoff score, taking smaller values at both extremes 
of the score range. hence there is no unique Kappa for a given test, and 
it is recommended (Millman, 1974) that its values be reported together 
with decriptions of the specific situations under which they are 
computed. 


6.2 Subkoviak's Group Co-efficient of Agreement : 

Subkoviak (1976) proposed а  single-test-administration 
estimate of coelficient of agreement giving the ‘extent to which the 
students would be assigned to be same mastery states’. He calls it 
group coetficient of agreement Pc which is given by the following 


formula M 
у? 
int 

Po = N 


^^i where Pc (i) is the coefficient of agreement for person i, C 
reters to criterion or cut off score for mastery and n is the number of 
students taking the test. PC in turn is given by the following formula:- 


Where FÀ is the coetficient of agreement for person i, c refers 
to criterion or cut off score for mastery and N is the number of students 
taking the test. Р is tum is given by the following:- 


Pe „ (P(X» OF + [1—P(Xi2 OF 


where P(Xi2 c) is (һе probability of consistent 
mastery/mastery decisions on a supposed two administrations venture. 


P(xi > C) denotes P when score Xi is equal to or above criterion 
C. This P (Xi 2 C) is given by the following :— 


P(Xi2C= Ly) Р (1—р) ^7* 
Хис 
where п = number of items іп the test and Рі is the true 
probability of correct item response for person i. The true probability Pi 


can be estimated from the students observed score Xi ор. а single test. 
For example any person's true probability estimate Bu 


A better estimate of Pi is given by the regression equation;- 
^ 


Xi Mx 
Pi = aan [%)+ 00 


where at is the Kuder-Richardson Formula 21 reliability 
coefficient (which is the squared correlation betwee observed and true 
score under simple binomial model) KR21 reliability is given by 
a 2K = п 1 Mn—M9 
n—i1 ns; 


where Mx is the mean and 52 is the variance. 


Thus the entier process starts from the computation of Pi using 
n, Mx and S and the person's score Xi. Subkoviak (1976) has given a 
tabular work theet having 


xi, pi, p, pC(i) and Pc. where sign stands for respective 
estimates. Interested workers may consult his paper for further details. 


Subkoviak's method of estimating the consistency of mastery 
classifications across parallel - form administrations can provide either 
individual or group information and can be obtained from а single 
administration of a test. However, it should be noted that the probability 
estimates obtained by this method are inflated because of the inclusion 
of chance agreement. 


7.0. CONCLUDING REMARKS : 

A number of different definitions and indices of reliability for 
criterion referenced tests have been proposed in an attempt to cope 
with possible lack of score variability that attenuates traditional 
coefficients. Teachers who develop CRTs have to choose first an 
appropriate category of reliability and then а specific index within that 
category, suitable for their purpose. 


The reliability information for CRTs needs to be reported on an 
objective-by-objective basis. If а criterion-referenced test measures 
more than a single objective, the test items should be arranged into 
clusters according to the objectives being measured. Within each of 
these clusters of items, domain scores may be estimated and mastery 
classifications may be made. Whatever the use of the scores, 
appropriate reliability information should be reported on each use of the 
scores derived from the test. 
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Decision Theoretic Approach То 
Criterion Referenced Testing 


R.K. Mathur 
ABSTRACT 


This paper outlines some appropriate statistical methods 
that may prove of use in making instructional decisions for 
classifying a student as ‘master’ or ‘non-master’ in the 
sequence of his formative evaluations. The discussions in this 
paper have centred on contributions to criterion-referenced 
testing in the area of definitions and terminology, allocation of 
the student to mastery states from a decision-theoretic point of 
view, and estimation of domain score. It has been emphasised 
that these procedures are merely aspects of a more general 
philosophy that teaching and learning should be child-centred 
and instruction and evaluation should be individualised to 
cater to the different learning needs and rate of growth of 
individual learner. The important aspects of a child-centred 
approach to learning is in keeping alive our optimistic faith in 
all learner's capacity for excellent learning and an approach to 
instruction that is systematic, interactive and learner 
oriented-an approach that consistently promotes student's 
cognitive and affective growth. 


(I) INTRODUCTION 


With the need for significant changes in our elementary and 
secondary schools clearly documented in the National Policy on 
Education (NPE)-1986 and Programme of Action (POA), we have to 
develop and implement a diverse collection of alternative educational 
programmes that seek to improve the quality of education by 
individualising instruction. It has been noted in NPE-1986 (Page 11) 
that "A child-centred and activity-based process of learning should be 
adopted ... Learners shouid be allowed to set their own pace and be 
given supplementary remedial instruction". In the context of evaluation 
process and examination reforms the NPE-1986 states (at page 24) that 
"evaluation process and assessment of performance should be an 
integral part of the process of learning and teaching and should be 
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employed to bring about qualitative improvement in education". А 
common and important characteristic, in functional terms, of the 
recommendations on examination reform and evaluation process is that 
the curriculum should be defined in terms of instructional objectives and 
learning outcomes. It has been noted in the POA (1986) that "the 
Boards of education will lay down the level of attainment and prescribe 
the learning objectives corresponding to these levels of attainment in 
terms of knowledge, comprehension, communication, applicational skills 
and the ability to learn". A programme specified in such a way is 
referred to as criterion-based. The overall goal of a criterion-based 
instructional and testing programme is to provide an educational 
programme which is maximally adaptive to the requirements of the 
individual learner. The instructional objectives specify the curriculum 
and serve as a basis for the development of curriculum materials and 
criterion-based achievement tests. Among the best examples of 
criterion-based programmes are Individually Guided System of 
Instruction (IGSI) (Glaser 1968), Programme for Learning in Accordance 
with Needs, (Flanagan 1969) and A Model for School Learning (Bloom 
1968; Caroll 1963; 1970 and Block 1971). 


While not all educators agree on the usefulness of these 
instructional and evaluation models in the schools, the position taken in 
this paper is that these models are useful and that the usefulness of 
these models will be enhanced by developing testing methods and 
decision procedures specifically designed for use particularly in the 
context of formative evaluation of the learner. The purpose of this paper 
is to outline some appropriate statistical methods that may prove of use 
in making instructional decisions for students. 


It appears that much of the discussion on criterion-referenced 
testing stems from different understandings as to the basic purpose of 
testing in these instructional models. It would seem to us that in most 
cases, the pertinent question is whether or not the individual learner has 
attained some prescribed degree of competency on an instructional 
performance task. Questions of comparisons among individuals seem to 
be, by and large, irrelevant in the context of criterion-referenced testing. 
In many of the new instructional models tests are used to determine on 
which instructional objectives a learner has met the acceptable 
performance level standard set by the test designer. This test 
information is usually used immediately to evaluate the student's 
mastery of the instructional objectives covered in the test, so as to 
locate him appropriately for his next sequential instructional unit. Tests 
specifically designed for this particular purpose have come to be known 
as Criterion-Feferenced Tests. Criterion-referenced tests are specifically 
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designed to meet the measurement needs of the new 
individualised-instructional models. In contrast the better known 
unom-referenced tests are principally designed to produce test scores 
suitable for ranking learners on the ability measured by the test. 


(M) CRITERION-REFERENCED TESTS : DEFINITIONS AND 
SELECTED ISSUES 

Criterion-Referenced Tests have been defined in a multitude of 
ways in the educational literature.See for example, Giaser and Nitko 
(1971); Millman (1974);Harris et al(1974); and Livingston (1972). The 
various definitions of criterion-referenced test have been reviewed by 
Millman (1974); Hambleton et al (1978) and by Singh (1982). We shall 
not go into the merits of the various definitions. But it appears to me that 
possibly the least restrictive definition of criterion-referenced testing has 
been proposed by Glaser and Nitko (1971): 

"A criterion-referenced test is one that is deliberately 
constructed so as to yield measurements that are directly 
interpretable in terms of specified performance standards. 
The performance standards are usually specified by defining 
some domain of tasks that the students should perform. 
Representative samples of tasks from this domain are 
organised into a test. Measurements are taken and are used 
to make a statement about the performance of each 
individual related to that domain (page 653)". 

It follows from the Glaser and Nitko definition that the 
construction of a criterion-referenced test requires sampling of items 
from well specified domains of items. If the domain has been well 
specified and the items selected from it by probability sampling, it 
becomes possible to estimate the domain score for an examinee. Only 
then this score has substantial meaning and can be interpreted on an 
absolute criterion-referenced scale. A common thread running through 
the various approaches to criterion-referenced tests is that the test 
developer has to give sufficient attention to domain specification and 
problems of item-sampling design. 


If one accepts the Glaser and Nitko definition of a 
criterion-referenced test it is apparent that the test may often be 
multi-dimensional while made up of uni-dimensional sub-scales. That is, 
the items from a criterion-referenced tests are organised in distinct and 
different sub-scales of homogeneous items measuring common skills. 
An instructional decision for each individual is then often made on the 
basis of his performance on each sub-scale. Major interest may thus 
rest on sub-scale scores rather than the aggregate score. 
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One of the problems yet to be reckoned with for 
criterion-reterenced tests is an instance of the band-width fidelity issue 
(Cronbach and Gleser, 1965). When the total testing-time is fixed and 
there is interest in measuring many competencies, one may be faced 
with the problem of whether to obtain very precise information about a 
small number of competencies or less precise information about many 
more competencies. The problem of how to fix the length of each 
sub-scale so as to maximize the percentage of correct decisions or 
some similar measure of overall decision-making accuracy on the basis 
of test results has yet to be resolved, or indeed, to be formulated 
satisfactorily. 


(11) DECISION-THEORETIC APPROACH ТО 
CRITERION-REFERENCED TESTING 

We shall assume that a criterion-referenced test is constructed 
by randomly sampling items from a well-defined domain of items 
measuring an instructional objective. (When a criterion-referenced test 
measures more than a single objective, the procedures described here 
may have to be repeated for each objeective). Our conceptual 
framework for criterion-referenced testing is as follows. We see testing 
as a decision theoretic process. One of the main differences between 
norm-referenced tests and criterion-referenced tests is in terms of the 
kinds of decisions they are specifically designed to make. 
Norm-referenced measurement is particularly useful in situations where 
one is interested in fixed quota selection or ranking of individuals on 
some ability continuum. Criterion-referenced measurement involves 
what Cronbach and Gleser (1965) would call a "Quota free selection 
problem". That is, there is no quota on the number of examinees who 
can exceed the cut-off score or theshold on a criterion-referenced test. 
A cut-off score is set for each subscale of a criterion-referenced test to 
separate examinees into two mutually exclusive groups. One group is 
made up of examinees with high enough score (greater than or equal to 
the cut-off score) to infer that they have mastered the material to a 
desired level of proficiency. The second group is made up of examinees 
who did not achieve the minimum proficiency standard. 


The primary problem in criterion-referenced testing models is 
one of determining if T, the student's true mastery-level, is greater than 
a specified standard. Here T is the "true-score" for an examinee in some 
particular well defined content domain. Since we cannot administer all 
possible items in the domain due to constraints of testing-time and 
resources we sample some small number of items to obtain an estimate 
of T represented as “T. The value of S is somewhat arbitrary threshold 
score used to divide individuals in the two categories i.e. ‘masters’ and 
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‘non-masters’. The obtained scores, however, may differ from the ‘true 
score’ (of the domain score) due to sampling of items from the universe 
or the content domain. 


Basically then, the examiner's problem is to locate each 
examineee in the correct category. There are two kinds of errors that 
occur in this classification problem : false positive and false negative. A 
false positive error occurs when the examiner estimates examinee’s 
ability to be above the cutting score when, in fact, it is not. And false 
negative error occurs when the examiner estimates an examinee's 
ability to be below the cutting-score when the reverse is true. The 
seriousness of making a false positive error depends to some extent on 
the structure of the instructional objectives. It would seem that this kind 
of error has the most serious effect on programme efficiency when the 
instructional objectives are hierarchical in nature. On the other hand the 
seriousness of making a false negative error would seem to depend on 
the length of time a student could be assigned to a remedial programme 
because of his low test performance. The minimisation ofexpected loss 
wouldthen depend in the usual way on the specific losses and the 
probabilities of incorrect classification. This is then a statistical exercise 
in the minimisation ofwhat we call the expected loss. In the section 
below we gie a mathematical decision-theoreticmodel for location of 
examanees to mastery states. 


In the next section we give the results on estimation of 
examinee's domain-score or true score for deriving the rules for location 
of examinees to mastery states or for estimation of examinees' domain 
Score. For this, we adopt a simple macro model of obtained scores 
named as Gaussian error model in Lord and Novick (1968). 


(IV) CLASSIFICATION INTO ONE OF THE TWO GROUPS : 
MASTERS & NON-MASTERS 
(a) Assumptions 

We assume that the obtained score X of a student (examinee) 
is the algebraic sum of two components, T and E. The component T 
represents the domain (or true) score of a student, a quantity relatively 
stable as long as the itms are sampled from the same universe or 
domain. The other component E is the random error of measurement, 
arising mainly due to sampling of items from the specified criterion 
domain. E is assumed to be normally distributed independently of T, 
with zero mean and a constant experimentally determinable variance 

 . Under the above assumptions 


() E(X) = Е(Т) = pw (Say) 
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and the variance с2 of the obtained scores X is 
(2) ok = oT4 0$ 


The conditional probability distribution of observed score of 
astudent whose domain-score or true-ability is T is given by 


(3) (Х/Т) = МТ, o9) 


Where N (T, с) denotes a normal distribution with mean Т and 
variance 03). £ 


Let the probability distribution of T be given by 
(4) =No 


The correlation between X and T is given by 


= OE LUE CTE- өл 
(5) prc Gum us: 


p is known as the index of reliability of X and р? which gives the 
proportion of the variance in test scores which is due to ‘true differences’ 
between individuals is known as the reliability coefficient of X. 


6) The joint probability distribution of T and X is given by 
( 


sez salt] "ЙӘ! 


writing ut - tand AM = x, the joint probability distribution of x 
x 
and tis given by 


Ф ара А а ED Ё—орх+° | 
t 2л 1—р 2(1—p ) 


Let E denote the cut-off score separating ‘non-masters’ from 
‘masters’. If a student's domain score (or true score) is less than &, his 
true group is ‘non-master’: if his domain score is greater than or equal to 
E, his true group is ‘master’. 


(6) OPTIMAL OPERATIONAL PASS-MARK 

Let п be the operational cut-off point (on the observed score 
continuum) for classifying a student as ‘master’ or ‘non-mster’. If a 
student's obtained score X is below т, he is classified as 'non-master'; 
if his obtained score X is т| or above, he is classified as ‘master’. 
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We shall call п Operational pass-mark. In following an 
operational mark т , the evaluator cam make two kinds of errors. If the 
Student is from true-class 'master the examiner can classify him as 
'non-master', the evalutor can classify him as 'master' (false positive 
lype of error) on the basis of his observed Score X. 


Let C1 represent the cost of misclassifying a student as 
'non-master' when his true class is ‘master’ and C2 the cost of 
misclassifying a student as ‘master’ when his true class is 'non-master'. 


Cost of Correct and Incorrect Classification 


Assigned Group 


Non-master 


Master 


These costs may be measured in any kind of unit. It is only the 
ratio of the two costs that is important. In practice, the Costs are often 
taken as equal. The expected cost of misclassification is given by 


(8) М= Са — Cop 

Where a and B represent the Probability of the joint events 
{X< n, T2¢} and{X >т, T < &}, respectively. Under the 
assumptions stated above, с and В are given by 


O а= Фор "a and 
о) (ә 
00 В=Ф ЕС j 


where i Же Gx 2M pis the index of reliability of the 


observed scores and Ф(һ) ‚зу (h, К, p) are defined in (11) and (12), 
respectively. 


1) ae =P 0a ara 
(12) y (kp) = PS ox tet 


where ọ (x, t) has been defined in (7). 
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Let &- K represent the relative undesirabilty of the two types 


of misclassification. Substituting о and p in (8) by thier values given by 
(9) and (10), respectively, and noting that C1 = KC2 


М reduces to 
(13) М= Ca[K зоне ка vafe x, o) 


For given values of K, p and xo, we can find the value of x1 
which minimises M. 


Minimisation of M leads to optimal operationa-pass mark т 


given by 
vo 


(14) 1 = H+ uY p? — 0 —7— ox 
where 0 is given by 


(5) 0 = L loge ce] forK #1 


=OforK=1 


It may be noted that when Р = 1, that is, the scores are perfectly 
reliable, the optimal operational pass-mark is equal to which is the 
prescribed cut-off score on the domain-score continum separating 
‘masters’ from 'non-masters'. 


It may also be seen that when the ratio of the undesirabilities of 
the two types of misclassification is unity, the optimal-operational 
pass-mark simplifies to 


(16) u+ @—н)/р° 
It may be noted that p <1 Therefore, if€ is greater than p as 


is usually the case, and if К is equal to 1, the operational pass-mark т 
will be greater than the cut-off score & . 


Thus, we have the following important result. When the ratio of 
the two costs of misclassification is unity and the cut-off score E is 
greater than the average of the observed scores X, (& is normally kept 
between 80% to 90%) then the optimal operational pass-mark v| is 
always greater than the prescribed cut off & point on the domain-score 
continuum, separating ‘masters’ from ‘non-masters’. 


We have in the above section worked out the optimal 
operational pass-mark from a decision-theoretic point of view, assuming 
the gaussian error model. 
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(V) ESTIMATION OF EXAMINEE DOMAIN SCORE 

There are several methods available for the estimation of the 
domain score for an examinee. The basic problem is, given an 
examinee’s obtained score X on a criterion-referenced test, to estimate 
the examinee's ‘ture’ (or domain) score, had he been administered all 
the items in the domain of items measuring the objectives covered by 
the test. 


One of the earliest attempts to produce an estimate of the true 
Score of an examinee was made by Kelly in 1927 (Lord and Novick 
1968 page 65). 


Mathur (1966, 1968) derived the true-score estimator assuming 
that the observed score distribution is moderately non-normal and 
represented by the first four terms of the Edgeworth's form of type A 
expansion. the author has shwon that the regression estimate of true 
Score of an examinee whose observed score X is given by 


x 2 
17) Fe pr POs runs 24 оноо) 


Агы Ал, 4 28 
+ ЧЕ He(X)- Ев) + =) 


Xn E x 
where Х= e Ф) = p onf- 3 


and Аз and 24 represent the third and fourth cumulates of x, 
and Hr(x) denotes the ты Polynomial of the rth degree; and 


Aa ^4 
(18) f(x) = (x) 1+ Hx) + 24H09 + 75 HsQ0 


It may be noted that if both Ao and 4 = 0 then f(x) = (x) and 
(17) reduces to 


A 
Tsp +p? (X—p) 
which is the well-known Kelly's estimator. 
It may be noted that if As & Aq both are non-zero, the density 


function of X iS non-normal, and in that case the true-score estimator is 
Non-linear in X as long as p° #1. 
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Result (17) has been derived by working out the regression of 
true-score on observed score under the assumption that the true score 
T is estimated by X, where X = T + E; E is normally distributed 
independently of X with zero mean and constant experimentally 
determinable variance of ; and the 


distribution of X = Хи is given by the first four terms of the edge 
х 
worth’s form of type A expansion. 


It may also be noted that Jackson (1972) used Kelly’s estimator 
for applying with binary date by transforming the scores by the arcine 
transformation. 


The Jackson's estimator is not ideal since it does not take into 
account any prior-information that may be available Novick, et al 91973) 
and Swaminathan et al (1975) used Bayesian decision-theoretic 
approach or estimation of examinee's domain score. The Bayesian 
solutions given by Novick et al (1973) or Swaminathan, et al (1975) are 
more complicated than the results deriving here under the simple 
macro-model. We shall not discuss here the Bayesian decision-theoretic 
procedures for estimation of examinee's domain-score or for allocation 
of examinees to mastery states. 


(VI) SUMMARY AND CONCLUSION 

The successful implementation of criterion-referenced 
instructions and testing programmes depends, in part, upon the 
availability of appropriate procedures for developing and utilizing 
criterion-referenced tests for monitoring student progress. The 
discussion in this paper has centred on contributions to 
criterion-referenced testing in the areas of definitions and terminology, 
allocation of the examinees to mastery-states from a decision-theoretic 
point of view, and estimation of ‘domain-score’ as regression-estimator. 


There can be no doubt that, if achievement testing (and 
educational assessment generally) is to advance significantly, it will do 
so only through coming to grips with the problems of domain 
specification, item-banking and items sampling. These procedures are 
merely agents of a general philosophy that teaching and learning shoudl 
be child-centred, and instruction and evaluation should, in course of 
time, be individualised to cater to the different learning needs and rate of 
growth of each individual learner. the method of learning recommended 
in Rigveda is obted below : 


acharyat padam a dhatte sisyah padam svamedhaya 
padam sa brahmacharibhyo padam kalakramena tu 
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The above verse means that a student should learn one-fourth 
from the teacher, one fourth from self-study, one-fourth from interaction 
with colleagues and one-fourth during application from time to time. 
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A Review of Research on 
Criterion-Referenced 
Measurement 


Chandrakant Bhogayata 
ABSTRACT 


In the present paper, a review of research on CRM is 

presented. This review includes six areas of research on CRM, 
namely (1) item - generation techniques, (2) item analysis, (3) 
determination of cut-off scores or setting of standards on CRTs, 
(4) reliability of CRTs, (5) application of Item Response Theory 
(IRT) to CRM, and (6) development and validation of CRTs. The 
present review is based on a relatively small and purposive 
sample of studies on CRM, hence the integration or meta-analysis 
of the results of the studies is not attempted. Trends in the 
different areas of research on CRM are described in this review. 

An interpretation of achievement test scores with reference to 
well-defined behavioural or content domains is called Criterion-Referenced 
Measurement (CRM). CRM can be used to monitor individual progress in 
objectives-based instructional programs, to diagnose learning deficiencies, 
to evaluate educational and social action programs, and to assess 
competencies on various certification and licensing examinations. 


CRM is still an educational innovation in India, but the millions of 
Students are routinely taking Criterion-Referenced Tests (CRTs) at all levels 
of education in the developed countries like U.S.A. There has been 
Considerable research on CRM. There are some reasons for the recent 
Proliferation of researches on CRM: (1) CRM movement is relatively newer 
than Norm-Referenced Measurement (МАМ): (2) there are many unresolved 
issues in CRM: and (3) there have been several developments in the 
measurement theory (e.g. Item response theory and graph theory) parallel 
to the developments in CRM. 


ITEM-GENERATION TECHNIQUES 

Every test item is in some way like a little test. Therefore 
item-generation is an important starting point for the development of a CRT. 
The literature on test-item writing has usually taken the form of advice on 
how to write items. But traditional test item writing has many limitations 
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(Bejar, 1983, р. 11). Now a technology for test-item writing has been 
emerged (Roid & Haladyna, 1982, p. 6). This technology is based on 
several theories of content specification. 


Roid and Haladyna (1978) contrasted two techniques for writing 
test-items: (a) writing items from statements of instructional objctives; and 
(b) writing items from semi-automated rules for transforming instructional 
statements (adapted from Bormuth). Items of each type were written by two 
experienced item writers. Students were given tests employing these items 
before and after randing programmed booklet. One item writer was found to 
produce consistently easier item than the other regardless of the 
item-writing technique employed. Both item writing techniques resulted in 
about the same number of faulty items, indicating that the "subjectivity" 
found in traditional item writing was also present with the semi-automated 
techniques. 


Berk (1980) compared six content domain specification strategies 
for CRM. He used a rating system with criteria to present a profile of the 
advantages and disadvantages of each strategy. His comparision revealed 
that the extent to which the six strategies achieve an unambiguous domain 
definition and overcome the stated deficiencies of objectives varies 
consideably. Amplified objectives, IOX test specifications, and mapping 
sentences fall short to these goals. Item transformations, item forms, and 
alogitithms attain unambiguous definition by means of sophisticated rule 
structures. The profile of the strategies suggested that the rigor and 
precision of the specifications are inversely related to their practicability. 
Berk concluded that a great deal of research needs to be conducted to 
appraise the validity of these ‘alternative’ and new starategies for 
item-generation. 


Macready and Merwin (1973) studied an item-generation technique 
developed by Hively and his co-workers. This technique is called item form. 
They studied the homogeneity of item within item forms. The results of the 
study suggested that in most cases item forms which generate items of 
moderate difficulty can be used to obtain relatively homogeneous sets of 
items of equivalent difficulty for a defined population of subjects. Such item 
forms provide sets of items difficulties alone were used to group items into 
sets. The resu.ts of the study also suggested a basis for identification of 
item forms which will generate homogeneous items of similar difficulty. 
Using this information it is possible to determine whether the breadth of an 
item form is appropriate and if not, identify changes which will lead to an 
item form of more useful breadth. 


2 ITEM ANALYSIS 


Item analysis is an acid test for the item of any test. Traditional item 
analysis using the item parameters such as difficulty and discrimination 
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values is based on classical test theory and it is more appropriate for NRM 
than CRM (Popham and Huseck, 1972, p. 139). Item analysis for CRTs has 
become one of the most controversial Steps of test development. 
Researchers in the field of CRM have shown keen interest in this area. 
Most of the research effort in this area is invested in the serach of an 
appropriate item statistic for CRM. 


Although many have rejected classical test construction and 
analysis procedures for CRTs, a study by Haladyna (1974) was concerneed 
with the possibility that classical procedures are both applicable and 
appropriate when samples of both mastery and nonmastery examinees are 
employed. Empirical evidence of this Study supported the practice of 
combining samples (i.e. mastery/nonmastery examinees) to increase the 
variance of test scores and thereby permit the proper estimate of reliability 
and item validities. Thus the combind samples point biserial discrimination 
index appeared to be the most efficient method for obtaining information 
about the adequacy of CRT items. 


Crehan (1974) compared threee CRT item selection techniques on 
resultant CR reliability and validity: (a) a traditional point biserial selection, 
(b) teacher selections, and (c) random selection. This study was ‘conducted 
with teacher-made mastery tests, a special variety of CRTs. There was little 
evidence that item slection method effected resultant test reliability and 
validity. Consistent evidence favoured the modified Bernnan and 
Cox-Vargas selection methods on resultant teat validity. Generalization of 
results in this study was limited since the observations were non-random 
and were obtained in traditional instructional situations. 


According to many experts in the field of CRM, item analysis by use 
in pre-tests and post-tests coefficients is the most appropriate item statistic 
for CRM. This relatively new item parameter is called instructional 
Sensitivity. Some researchers have suggested different pre-tests and 
Post-tests coefficients 


Herbig (1976) empirically compared six pre-to-posttest coefficients. 
The six coefficients differed in the usability in small samples, in the 
Computability, in having limits for the range, and in the interpretation. The 
use of the Cox-Vargas and the two Herbig indices seemed to have some 
advantages. 


In a study, Hanson, McMorris and Bailey (1986) presented an 
Operational framework and a set of empirical procedures to guide item 
Selection for CRTs. They found that even when items were closely matched 
lo instructional content specifications, important differences in instructional 
Sensitivity emerged. These differences were found between with same items 
Presented in different formats as well as between different items presented 
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within the same format. The results provided guidance as to the specific 
combination and number of items that should be included in the test for 
each skill. 


Van der Linden (1981) gave a latent trait or item response theoretic 
look at pretest-posttest validation of CRT items. Although the method is 
generally considered as the prototype of CR item analysis, van der Linden 
has discussed some serious disadvantages of it, such as : (1) it leads to 
indices based on a dual test administration; (2) it has population-dependent 
item p values; (3) it provides the global information about the discrimination 
power; (4) it supposes an implicit weighting; and (5) it leads to a 
meaningless maximization of posttest scores. van der Linden conducted an 
empirical study to compare the differences in item selection between the two 
methods of item analysis : (a) pretest-posttest validation, and (b) item 
response theoretic analysis. He proposed to replace pretest-posttest indices 
like Cox and Vargas' Dpp by an evaluation of the item information fucntion 
forthe mastery score. 


Bambleton and de Gruijter (1983) applied the Item Response 
Theory (IRT) model to CRT item selection. Their Study showed clearly the 
theoretical advantages of optimal item selection based on IRT models over 
one of the more common alternative Strategies, random item selection. 
Shorter tests can be used to achieve acceptable levels of misclassification 
when optimal items are selected. For this Study, it was assumed that the 
CRTs will bé used to make mastery/nonmastery decisions. The item 
Selection method based on IRT models is not suitable for CRTs that are 
intended to provide descriptive information in the forme of unbasied domain 
Score estimates. 


Harris and Subkoviak (1986) suggested as short-cut statistic for 
selecting items for mastery tests. They examined three statistical methods 
for selecting items : (a) the pretest posttest method due to Vox and Vargas; 
(b) a latent trait or IRT method; and (c) the agreement statistic. A number of 
distinct data sets were stimulated; and the three item selction methods were 
applied to each data set for the purpose of studying relationships among the 
methods. The correlation between the IRT method and the one proposed 
therein - agreement statistic was Substantial, suggesting that the later might 
be recommended as a pratical alternative to the former. The results for the 
pretest-posttest method tended to confirm its well-known limitations. 


From the above-mentioned studies on CRM item analysis, the 
following conclusions can be drawn : 


1. There have been emerged four types of item statistics for CRM 
item analysis : (a) difficulty and discrimination values based on 
Classical test theory; (b) pretest-posttest coefficients; (c) item 
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parameters based оп IRT; and (d) agreement statistics of Harris 
and Subkoviak (1986) 

2. Each type of item statistics has its own limitations. Dfficulty and 
discrimination values based on classical test theory and 
pre-posttest coefficients are psychometrically weak statistics. The 
error component of the measurement based on these statistics 
remains more unemplained. Item statistics based on IRT models 
are very complex. Agreement statistics can be used only upto the 
classroom level and they require further research. 

3. Seemingly trivial variables such as an item format have significant 
effect on the item parameters. 

In short, CRM item analysis has still to get its own Aristotle! 


3. DETERMINATION OF CUT-OFF SCORES OR SETTING 
STANDARDS 

One of the primary purpose of CRM is to make decisions about 
individuals. This requires a standard or cut-off score on the test score scaled 
to seperate examinees into two categories, oftenlabelled ‘masters’ and 
‘nonmasters’. However, it should be clear that the determination of a cut-off 
Score is not a critical attribute of CRM. 


It is essential to stress that all standard setting methods involve 
judgement and are aribitrary. The process of setting performance standards 
is open to constant criticism and remains controversial to discuss, difficult to 
execute and almot impossible to defend (Berk, 1986). 


Berk (1976) proposed an empirical methodolgy for the 
determination of optimal cutting scores for short-fixed length CRTs. The 
optimal cutting score was selected according to the estimated probabilities 
of correct and incorrect mastery/nonmastery decisions using validation 
samples of instructed-uninstructed students. He has discussed several 
problems with the use of this methodology and suggested that further 
research should investigate the effects of using different item formats, test 
lengths and validation samples. 


Van der Linden (1986) has applied a latent trait or IRT method to 
determining intrajudge inconsistency in the Angoff-Nedelsky techniques of 
Standard setting. 


Recently, Berk (1986) has described and evaluated the salient 
characteristics of 23 continuum standard -:setting methods in the form of a 
"consumer's guide". A trilevel classification scheme was used to categorize 
the methods, and 10 criteria of technical adequacy and practicability were 
propsed to evalutate them. The first, most general level of classification 
partitioned the methods into two major categories based on their 
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assumptions about the acquisition of the underlying trait or ability : state and 
continuum. The second level classified the methods according to whether 
they are based entirely on judgement (judgemental), primarily on judgement 
(judgemental empirical), or primarily on test data (empirical-judgemental). 
The third level distinguished between empirical-judgemental procedures that 
are intended to set standards and those that are designed to adjust 
standards, According to the results of Berk's evaluative review, among the 
judgemental methods, the Angoff method appeared to offer the best balance 
between technical adequacy and practicability. In the category of 
judgemental-empirical methods, the informed judgement method yielded the 
highest rating. Among the five methods in the empirical-judgemental 
category, the contrasting group methods received the highest ating overal. It 
also was given the highest rating for technical adequacy out of the 23 
continuum methods. Berk's evaluation system suggested that as far as the 
use of CRM as a part of systematic instruction is concerned, a systematic 
standard-setting method such as contrasting or criterion groups should be 
implemented. It can suggest a cut-off score and also provide decision 
validity evidence. 


4. RELIABILITY 


The assumption that reliability indices, based on internal 
consistency, are not particularly relevant to CRM can be traced, at least in 
part, to a seminal article by Popham and Huseck (1972). They argued that 
since CRTs are designed to determine a person's achievement compared to 
some criterion, the meaning of the score should not depend on the scores of 
other people. Therefore Popham and Huseck concluded that "variability is 
not à necessary condition for а good CRT" (p. 135) and that reliability 
indices based on score variability "are not only irrelevant to CR uses, but are 
actually injurious to their proper development and use“(135), 


Swaminathan, Hambleton and Algina (1974) proposed that the 
reliability of CRt scores be defined in terms of the consistency of the 
decision-making process across repeated administrations of the test. 
Sepecifically, reliability was defined as a Measure of agreement over and 
above that which an be expected by chance between the decisions made 
about examinee mastery states in repeated test administrations for each 
objective measured by the CRT. They formulated a decision-theoretic 
method for CRT reliability by applying the coefficient k introduced by Cohen. 
They recommended that information such as cutting scores and student 
ability as measured by the test be reported along with the reliability index. 


When reliability is defined as the ratio of true variance to observed 
variance (Lord and Novick cited in Lovett, 1977), it can be seen that the 
reliability coefficient is a measure of the amount of Observed variance 
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attributable to deviations in individual performance from some point C on the 
score scale. In the case of CRT, C is generally a minimum level of 
acceptable performance, or cutting score. Lovett (1977) developed and tried 
out a generalised ANOVA procedure for use in estimating the reliability of 
CRTs. He extended the definition of CRM reliability to the mean of a number 
of parallel measures. A typical test situation was described as a 
randomized, complete block design. Expected values for the mean squares, 
error and person, were derived and shown to be equal to observed and true 
variance for CRT's CRT reliability was then redifined in terms of expected 
values, error and person. 


In practice, most CRTs are designed to measure performance on 
several learning objectives. For such tests, it is also common to have a 
different number of items for each objective and therefore a seperate cutting 
score for determining mastery on each objective. Raju (1982) generalized 
Rajaratnam, Cronbach and  Glasers generalizability formula for 
Stratified-parallel tests and Raju’s coefficient beta to estimate the reliability 
of a composite of CRTs, where the parts of the composite have different 
cutting scores. The new formulas are especially useful, for example, in 
estimating the reliability of a CRT with several objectives, where each 
objective has a different cutting score. 


Wilcox (1983) described and compared the seven procedures for 
estimating the reliability of a CRT. The procedure were based on the single 
administration of a CRT scored with a latent structure model. Results 
Suggested that the predictive estimate is the most accurate of the 
procedure. 


If the reliability of a CRT is very low, differences in observed scores 
can be attributed to errors of measurement rather than to differences in 
individual's level of mastery of the domain. The analyses by Kane (1986) 
suggested that if the reliability (defined in terms of internal consistency) is 
much below 0.5, the test will not provide more accurate estimates of 
universe scores defined on a domain of items, than would a simple a priori 
procedure based on gourp performance. Thus, Kane demonstrated the role 
of ‘classical’ reliability in estimating universe scores on the domains of 
items. He also suggested three solutions for a CRT with low reliability : (a) 
lengthening the test; (b) defining the domain and item generation 
procedures more carefully, and (c) estimating the mean universe score for 
the group. It should be clear that the analyses presented by Kane do not 
apply to decision accuracy when a cut-off score is used to place students in 
mastery categories. 


Researches on reliability of a CRT under the present review have 
addressed the following dimensions : 
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1. Search for a prototype reliability indice of CRTs with cut-off scoes 
(Swaminathan et al., 1974). 

2. Use of ANOVA in estimating the reliability of CRTs (Lovett, 1977). 

3. A procedure for estimating the reliability of a composite of CRTs, 
where the parts of the composite have different cutting scores 
(Raju, 1982). s 

4. Use of single administration for estimating the reliability of CRTs 
(Wilcox, 19830). 

5. The role of traditional reliability index in estimating universe of 
domain scores on CRTs (Kane, 1986). 


5. APPLICATION OF ITEM RESPONSE THEORY TO CRM 

Generally, researchers have applied four measurement theories or 
models to CRM : (a) classical test theory, (b) generalizability theory, (c) Item 
Response Theory (IRT), and (d) graph theory or order theory. 


Item Response Theory (IRT) is a modern measurement theory. It 
has several advantages over other measurement theories, It has several 
advantages over other measurement theories. It can truly and completely 
generalize the measurements across facets such as people, items, times, 
raters, test forms, and other conditions of testing. It makes social or 
behavioural measurement as scientific as physical measurement. Appliation 
of IRT is one of the recent advances in CR achievement measurement. 


Lord (1980) proposed a twelve-stepped algorithm to design a 
mastery test of a unidimentional skill with about equally important erroneous 
acceptance and rejection of examinees. he also Proposed another algorithm 
{о design a mastery test with the relative importance of decision errors. His 
algorithms were based on three parametric logistic model of IRT for 
dichotomous items. 


We have already noted that van der Linden (1981) has given a 
latent trait or item response theoretic look at pretest-posttest validation of 
CRT items. He proposed to replace pretest-posttest indices by an evaluation 
of item information function for the mastery score, 


Hambleton (1983) used IRT models for obtaining accurate 
examinee domain score estimates and for increasing the probability with 
which examinees are assigned correctly to mastery states with CRT scores. 
He compared one-, two-, and three-parameter logistic test models for 
estimating domain scores and making mastery/nonmastery decisions. The 
one-and three-parameter model resulted in highly comparable results for 
middle and high ability examinees, while for low ability examinees, the more 
general model always performed somewhat better. 
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As mentioned earlier in this review, Hambleton and de Gruijter 
(1983) have successfully demonstrated the theoritical advantages of optimal 
item selection based on IRT models over one of the more common 
alternative strategies, random item selection. 


Van der Linden (1986) has applied latent trait or IRT to CRM have 
tried to resolve the problems of test design, item analysis, item selection and 
standard setting. 


6. DEVELOPMENT AND VALIDATION OF CRTS 


Some researchers have developed and validated CRTs in different 
content areas for different grade levels. In te countries like U.S.A., many 
publishers have made CRTs available commercially. 


Hambleton and Eignor (Cited in Hambleton, 1982) evaluated 11 of 
the more popular commercially available CRTs using an evaluation system. 
They concluded that there was substantial room for improvement in the 
preparation of commercially available CRTs and in reporting of content and 
technical information. The CRTs under their evaluation were 
Objectives-referenced tests, since the tests were developed from 
behavioural objectives. 


Bhogayata (1986) integratively reviewed 45 CRTs described and 
primarily reviewed in the Eighth Mental Measurements Yearbook (MMY). He 
reviewed these tests for their psychometric properties, He found that the 
tests labelled as CRTs in the eighth MMy were not true CRTs, but they were 
only “slogan” CRTs. These tests were not technically sound in the 
Psychometric dimensions of domain specification, standard setting, reliability 
and validity. 


Smith, Smith and Brink (1977) developed and validated CRTs titled 
as Standard Achievement Recording System (STARS) for the 
Measurements of first to sixth graders. These tests were developed from 
well-defined domains. The items were selected randomly from the domains. 
STARS were consistent for the decisions of classification of examinees as 
masters or non-masters. The tests were internally valid measures. The main 
limitations of STARS were : (1) standard setting of 100 percent for mastery 
was not procedural; (2) the test items were not logically and empirically 
reviewed for item—objective congruence and instructional sensitively, 
respectively; and (3) an evidence for construct validity was not reported. 


Verma (1984) developed and validated а CRT of fifty Tatsam 
suffixes in Hindi as а part of his doctoral thesis. The testitems wer 
empirically reviewed for difficulty and instructional sensitivity. The test items 
were reviewed by three subject matter experts, but any quantified results of 
theis review were not reported. Decision-making reliability was established. 
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The pretest to posttest gain supported the content validity of the test. 
MAstery level was set at 85 percent achievement standard. The main 
limitations of Verma's test were : (1) the standard setting at 85 percent for 
mastery was not procedural; (2) the test-items were not generated from 
well-defined domain; (3) the test-items were not estimated validities were 
not reported. 


A conclusion from this review of developed and validated tests can 
be brought about that the development and validation of a true CRT is still a 
challenging task like breaking the Lord Shiva's bow! 


7T. CONCLUDING REMARKS 


Under the foregoing review, six major areas of research on CRM 
were covered. Some researchers have investigated other areas of URM : 
Criterion-related validation of CRTs (Tindal et al.,1985); estimation of 
mastery states (Hambleton et al., 1976); and tet length (Millman, 1973). The 
reviewer could not find the sufficient number of investigations in these 
areas, so they were not included in the present review. 


It is the message of the present review that there are still many 
unresolved problems in the field of CRM waiting and inviting for tremendous 
research effort. 
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An Empirical study of Development 
of Criterion Referencd Tests At The 
Primary Stage 
Pritam Singh, Kamla Menon, J.P. Shourie 

ABSTRACT 


An empirical study was undertakin to develop and use 
criterion-referenced tests to identify mastery levels of students 
at the primary stage in schools of Delhi. Using the concept 
analysis technique specification (Science) at the class III level, 
the attempt was made to develop criterion referenced test. 
The intended learning outcomes were used as the basis for 
item generation and two parallel forms of the test were 
developed. Both judgemental and empirical techniques were 
used for review of domain description and item objective 
(l.L.O.) congruence. This was followed by a field tryout to 
establish the test quality by determining validity and reliability 
the test using C.R. approach. 

It was clearly seen that the I.L.O. based development of 
criterion-referenced tests is quite functional. Both classifical 
and CR-based techniques of establishing test quality function 
equally well on these tests, the former for judging the quality of 
the test and the latter for judging the effectiveness of 
instruction. 

The actual mastery level in selected public schools, 
Kendriya Vidyalayas and Corporation schools of Delhi; as also 
the gaps in learning were studied and the results arrived at 
provide significant clues to improvement of student's learning. 
The study indicates that where instruction si more activity 
oriented and the examples used are mostly concrete the 
students' performance is better. Further 95% mastery is 
achievable if instructional intervention is delibrately planned 
using learning outcomes as the basis and remedial measures 
are adopted after diagnosing the students’ inadequacies in 
learning various concepts. It is interesting to find that number 
of masters are more in case of students from Corporation 
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Schools as compared to those from Public Schools апа 
Kendriya Vidyalayas. 


1. INTRODUCTION : 

The National Policy of Education underlines the importance of 
grading pupils so as to ensure improvement of learning. Today hardly 
20-25% learners achieve about 75% attainment of the intended level the 
remaining 75 to 8096 of pupils are considered qualified even if their 
actual learning is merely 3096 of the intended learning. It is in such a 
Situation that criterion-referenced tests can be used to ensure that all 
the intended learning outcomes are attained at about 80% level by most 
of the learners. There is dire need to ensure that tests for such n 
assessment are readily available to the teachers. 


A project was undertaken to develop CR Tests for class-lll in 
Environmental Studies (Science). The construction of these lests was 
specifically to verify the domain based test development techniques. 
Further these tests were tried out in different types of schools viz. 
Corporation Schools, Central Schools and Public Schools of Delhi to 
ascertain the mastery levels of students and the actual attainment 
compared to the expected achievement in the subject of Environmental 
Studies (Sciences) at the class-Ill level. 


2. RELATED STUDIES 


There has been considerable effort to develop criterion 
referenced testing parameters comprable to norm-referenced testing. 
Given the relatively short period of developments in C.R. Testing there 
has been considerable research in domain specification techniques, 
item development strateges and identifying test qualities based on the 
criterion referenced approach. 


One of the developments in the area of content specification as 
Proposed by Popham (1974) and Shoemaker (1975), has been that 
clearer the definition of competency the easier is to ascertain the 
mastery levels. Since Lindquist (1950) methods indicated for content 
Specification reflected the use of hierarchies (While 1974) of content 
elements, structural systems (Secandura (1974) and critical incidence 
technique (Watson 1983). Another approach is to use objective as basis 
for integrating content with behavioural outcomes known as domain 
specification strategy. 


The item generation techniques based in the definition of domain 
are many depending on their utility for computer based techniques. In all 
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of these item generating methods it is assumed that domain is infinite 
and hence several approaches are found for calculating reliability. 


The test construction requires the definition of test length and 
the methods of analysing the quality of the test. The studies undertaken 
in this area indicate that in C.R. Testing, short test with homogeneous 
items serve the purpose. The level of discrimination varies inversely 
proportion to test length. Item Analysis techniques discussed by 
Popham and Husek (1972) reject the classical methods of test analysis 
and emphasise the mastery learning approach. (Anladyra and Gale 
1981) emphasise the role of instruction and suggest the need to 
calculate test characteristic on this basis. Item-objective congruence has 
been recommended for finding quality. The quality of test items have 
been studied from many angles particularly examinee instructor 
valuation (Hanson, Memoriss and Bailey 1986) Harrison and Subkoviak 
1986), pre-test and pot test item quality and agreement coefficients of 
different qualities. 


Studies related to reliability and validity of C.R. Tests indicate 
the use of collateral data to find domain - item congruence and the 
correctness of estimation of the domains specification (Ebel 1962) and 
Chronbach (1971). Reliability studies are based on the context in which 
the reliability estimates are made. The most researched area is the 
estimation of the reliability from single administration (Wilcox, Huynh 
1976). The aspects covered are, of finding reliability of sub-scales 
where different scores are identified for different parts of the test. 
Another area where considerable research has been done is on the 
appropriateness of traditional reliability indices for CR Test CR Tests 
reliability estimates. 


The definition of the criterion cut off score and correct 
estimation of performance where there is always considerable chance of 
misclassification, is of considerable importance. The state model 
assumes a 0-1 marking and in the continuum model assumes a 
continuous and different levels ranging from lowest level of mastery to 
the complete masterly. Decision theoretic models have compared the 
estimate of different approaches to the classification of masters and 
found the Bayesian methods produce the best results (Berk 1986). 


In the light of the above mentioned background in the area of 
research on different aspects of Criterion-Referenced Testing, a project 
was undertaken to develop C.R. Tests, Focus was on identifying hard 
Spots in learning of a cocept based subject and determining mastery 
level. Another aspect of the study was to compare the performance of 
Students from diffeent types of schools in terms of the number of 
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masters and the level of mastery attained in Selected area of 
Environmental Studies (Scince) prescribed by NCERT for Class-lil. 


3. OBJECTIVES OF THE STUDY 
The objectives of the study to develop criterion referenced tests 
at the primary stage was 


1. To developing criterion referenced test in a concept based 
subject. 

2. To identifyign hard spots in learning of different areas of this 
subject 


3.3 To identify the mastery level in terms of concepts learnt by 
students from different types of schools. 

3.4 To compare the performance of students from different types 
of schools in terms of masters and level of mastery in the 
subject. 

The development of the tests was done in Environmental 
Studies (Science) and schools run by Government Kendriya Vidyalaya 
and private schools were selected for tryout. In each type of schools 900 
Students were given the test on which the analysis reported was based. 


4. LIMITATION OF THE STUDY: 

The objectives of the study having been defined the present 
study was limited to the subejct of Environmental Study (Science) test 
prescribed for class-Ill by NCERT and restricted to the multiple choice 
types with 4 alternatives. Only 20-30 items were included in each of the 
lests based on a domain. The tests were tried out in three types of 
schools viz, the Un-laided Public Schools, the Kendriya Vidyalayas and 
Municipal Corporation Schools situated in urban areas of Delhi. 


5. THE RESEARCH QUESTIONS 
Keeping in view the objectives of the study following research 
questions were identified for finding answers to them. 
5.1 What is the difference in the mastery level of studens from 
the three types of schools? 


52 What are the gaps in intended learning outcomes and 
observed outcomes of learning? 


5.3 Which are the concepts that are mastered by most of the high 
achievers? 


5.4 Which are the concepts that are mastered by most of the 
students. 
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6.0 PROCEDURE ADOPTED 


6.1 Domain Identification and Description 

The main consideration while determining and defining the unit 
domains was to establish its sequence with the remaining units and 
internal co-hensiveness of the concepts. The domains were then 
analysed for identification of concepts based on its amenability to 
teacing in class-Ill and its testability through a written test. 
Corresponding to each concept the intended learning outcomes (I.L.Os) 
were formulated and both the list of concepts and intended learning 
outcomes were reviewed using teacher's judgements and rating of 
objective-domain congruence by the expert. 


6.2 Item Generation and Editing 

The item generation technique used was construction. of 
concept based item corresponding to intended learning outcomes. Two 
forms of the items were developed related to each I.L.O. The tests were 
written in Hindi and English simultaneously and there was an attempt to 
see that the context and content of the question be made the same in 
both versions. Modification were made when equivalnce of translation 
was not there and changes had to be made in either version. Item-l.L.O. 
congruence was ensured through experts’ judgements and two forms of 
each of 20 tests were developed and finalised. 


6.3 Pilot Tryout of tests 

The pilot tryout of the tests was undertaken in 13 schools, 7 
Public, 4 Kendriya Vidyalayas and 2 Corporation Schools on à sample 
of 300 students. The item analysis helped to identify and improve 
defective items to make the tests more reliable. Final tests were then 
printed and tried out on the larger sample of 26 schools of which 4 were 
Public Schools, Kendriya Vidyalayas and 12 Municipal Corporatioh 
Delhi schools comprising 802, 794 and 942 students respectively. In all 
2538 students were involved in the find tryout of tests. 


6.4 Analysis of Results 

The final analysis of the test results was done to find the 
reliability and validity of the tests. Item analysis was undertaken to 
identity the objectives (LL.Os) which were mastered and partially 
mastered thereby, providing evidence on the efficacy of objectives for 
the class-lll. 
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7.0 DISCUSSION OF RESULTS 


7.1 Parallel Form Co-efficients of Correlations 

The quality of the criterion-referenced tests was assessed after 
the final tryout. In order to establish Parallelism of the Test forms A and 
B the coefficient of correlations between the two parallel forms of each 
test were computed for all the tests. Of the 20 tests 14 had a correlation 
coefficient varying from 0.5 to 0.7 while 6 tests had a correlation 
coefficient of 0.7 and above. Low correlation between forms A and B 
was could be due to the short test length and the practice effect on 
performance of students on Form-B. 


7.2 Agreement Coefficient : 

Test reliability using the classical KR-21 index varies between 
0.63 to 0.88 which is satisfactory for classroom tests. The 
criterion-referenced index of reliability i.e. Agreement Coefficient ranges 
between 0.60-0.85 in various tests which is again satifactory. 


7.3 Kappa Index 


The use of the Tuckman item quality checklist and expert 
judgement for establishing item-objective congruence has proved 
useful. The criterion referenced Kappa index of gain in consistency 
indicates that nine outof the twenty tests had low scores which is 
probably due to inadequate instruction. 


3.4 The item quality was analysed keeping in view the item 
difficulty and discrimination indices. Out of the 25 items included in the 
test on and average each test had 13 iterns with a discrimination indices 
of 0.5 and above and the same number of items had a difficulty level of 
40-85 per cent. Since the tests were meant to identify mastery level the 
items which were mastered by all the high achievers were retained. 


8.0 FINAL TRYOUT 


The test quality having been established these tests were tried 
out on a sample of 100 students each from the 3 types of schools 
(Public, KVS and Govt) in the U.T. of Delhi. Following are the 
conclusions of the analysis of pupils' performance on the C.R. tests for 
Class III in Environmental Studies (Science) 


M In the Schools which were used for tryout the tests were 
administered after instruction by the teacher and the tests were given on 
the same day. The first test and second test were parallel so there could 
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be some practice effect їп case of Form-B of the test administered after 


Form-A. 


8.1 Attainment of Mastery Levels 


(a) 


(b) 


(c) 


(9) 


Preliminary try-out of test on one unit revealed that the 
Application objective was too difficult to include for students 
of class-lll level. As such this objective was not considered 
for formulating the intended learning outcomes corresponding 
to various concepts. 

Out of the entire sample of 17% students attained the 
mastery level, 9% from M.C.D. schools, 5% from the Public 
schools and 3% from Kendriya Vidyalayas. 

Over a!l best performance of students was from the M.C.D. 
schools where the difference in mean score of masters and 
non-masters is the least. It indicates better instructional 
efforts for weak students in these schools. 

The master group in all three schools show a similar mean 
achievement which supports the view that mastery level 
identified is attainable provided the instruction is I.L.O. based. 


8.2 Concept Attainment Level 


(a) 


(b) 


(с) 


(9) 


Analysis of pupils responses indicate that by and large most 
of the abilities implied by knowledge and understanding 
objectives are not attained at the intended mastery level. 

The hypothesised hierarchy of abilities in terms of N.C.E.R.T. 
taxonomy was found substantially changed. It reflected order 
for different abilities as ability to thereby indicating the 
emphasis given on various abilities at present in our 

Four domains viz. ‘Soils and Crops’ ‘Dissolving property of 
Liquids’ Housing and Clothing and ‘The Earth’, were found to 
be very difficult for the entire group which indicates that the 
concepts included under these domains are too difficult to be 
understood by the students at this level. Therefore, the need 
for reconsideration of these concepts for inclusion in class-III 
syllabus cannot be over-emphasised. 

This study has shown that the intended learning outcomes for 
domain specification is a valid and effective approach for 
construction of criterion-referenced tests. The varying 
emphasis given during instruction as revealed in the 
Performance measures of different abilities further supports 
the need to use objective-based instruction and remedial 
teaching for developing mastery of concepts to achieve the 


219 


intended learning outcomes at desired level of intended 
performance. 


9.0 IMPLICATIONS OF THE STUDY 

The introduction of criterion-referenced tests in the subject of 
Environmental Studies particularly at the primary stage is a big 
departure from traditional norm-referenced testing. Here pupils" 
performance is valued against intended performance standard and not 
against class performance. Focus is not on pupil comparison but on 
improvement of pupils' achievement. These criterion referenced tests 
help in setting standards in a Subject for a class thereby focussing both 
learners and teachers attention on improvement of teaching and 
learning rather than merely passing judgement on pupils’ performance. 
Having a reference of content of learning in terms of criterion of 11.05 
helps teachers to identify students’ weaknesses and concentrate on 
remedial action. This enables the teachers to remain cognizant of the 
goal and standard to be reached so that clear evidence of attainment in 
terms of mastery is recognizable. In the school curriculum graded 
placement of concepts is possible in a subject and instruction can be 
effectively planned at each Stage to achieve the intended learning. 


The philosphy underlyingthis approach of testing rests on the 
improvement of leaming by setting desired performance standard, 
adapting teaching learning strategies to reduce wastage and stagnation 
through remedial instruction and maximise the number of students who 
attain the intended masterly level of the concepts. This C.R. approach to 
teaching both objectives of excellence and equlity of learning. 
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